I recently tried Whisper to transcribe our local Seattle Fire Department radio scanner -- unfortunately it was not reliable enough for my use case, e.g. "adult male hit by car" gets transcribed as "don't mail it by car".

I imagine future models will allow the user to input some context to disambiguate. Like if I could give it the audio along with the context "Seattle Fire Department and EMS radio traffic", it would bias towards the type of things you'd likely hear on such a channel.

Have you tried the --initial_prompt CLI arg? For my use, I put a bunch of industry jargon and names that are commonly misspelled in there and that fixes 1/3 to 1/2 of the errors.

I was initially going to use Azure Cognitive Services and train it on a small amount of test data, after Whisper released for free I use Whisper + openai GPT-3 trained to fix the transcription errors by 1) taking a sample of transcripts by Whisper 2) fixing the errors and 3) fine-tuning GPT-3 by using the unfixed transcriptions as the prompt and the corrected transcripts as the result text.

Whisper with the --initial_prompt containing industry jargon plus training GPT-3 to fix the transcription errors should be nearly as accurate as using a custom-trained model in Azure Cognitive Services but at 5-10% of the cost. Biggest downside is the amount of labor to set that up, and the snail's pace of Whisper transcriptions.

There have been a lot of hacks to speed up whisper inference

Sweet! Do you have any links to resources on how to speed it up? I couldn't find any while searching Google or the Whisper discussion forums.

Not a hack per se but a complete reimplementation.

https://github.com/ggerganov/whisper.cpp

This is a C/C++ version of Whisper which uses the CPU. It's astoundingly fast. Maybe it won't work in your use case, but you should try!