This is really cool. I really enjoyed having this built into my Pixel a few years ago (super useful when you want to watch a video in public but don't have headphones). The implementation in Chrome doesn't work that well.

It would be great to support both OpenAI's whisper model and a customized vocabulary (I find that a lot of transcription errors are because the set of words I'm exposed to don't necessarily fall inside the most common 50k or 100k words).

Is Whisper fast enough on a local machine to be giving live captions? I tried the python module and it takes a long time to process audio files.

There are various size options for the models. The choice of which you use trades off accuracy for higher performance.

There is also a C++ re-implementation that performs well and can definitely transcribe in realtime on many machines: https://github.com/ggerganov/whisper.cpp