What does HackerNews think of whisper-asr-webservice?
OpenAI Whisper ASR Webservice API
On the end user application side, I wish there was something that let me pick a podcast of my choosing, get it fully transcribed, and get an embeddings search plus answer q&a on top of that podcast or set of chosen podcasts. I've seen ones for specific podcasts, but I'd like one where I can choose the podcast. (Probably won't build it)
Also on the end user side, I wish there was an Otter alternative (still paid $30/mo, but unlimited minutes per month) that had longer transcription limits. (Started building this, not much interest from users though)
Things I've seen on the dev tool side:
Gladia (API call version of Whisper)
Whisper.cpp
Whisper webservice (https://github.com/ahmetoner/whisper-asr-webservice) - via this thread
Live microphone demo (not real time, it still does it in chunks) https://github.com/mallorbc/whisper_mic
Streamlit UI https://github.com/hayabhay/whisper-ui
Whisper playground https://github.com/saharmor/whisper-playground
Real time whisper https://github.com/shirayu/whispering
Whisper as a service https://github.com/schibsted/WAAS
Improved timestamps and speaker identification https://github.com/m-bain/whisperX
MacWhisper https://goodsnooze.gumroad.com/l/macwhisper
Crossplatform desktop Whisper that supports semi-realtime https://github.com/chidiwilliams/buzz
But I don't know what you consider being "in production". If it's for internal use then it is enough.
Here are some comparisons of running it on GPU vs CPU According to https://github.com/MiscellaneousStuff/openai-whisper-cpu the medium model needs 1.7 seconds to transcribe 30 seconds of audio when run on a GPU.
Hah, I love that - "benchmark by fan speed".
Good to know - I've tried large and it works but in my case I'm using whisper-asr-webservice[0] which loads the configured model for each of the workers on startup. I have some prior experience with Gunicorn and other WSGI implementations so there's some playing around and benchmarking to be done on the configured number of workers as the GPU utilization of Whisper is a little spiky and whisper-asr-webservice does file format conversion on CPU via ffmpeg. Default was two workers, is now one but I've found as many as four with base can really improve overall utilization, response time, and scale (which certainly won't be possible with large).
OPs node+express implementation shells out to Whisper which gives more control (like runtime specification of model) but almost certainly has to end up slower and less efficient in the long run as the model is obviously loaded from scratch on each invocation. I'm front-ending whisper-asr-webservice with traefik so I could certainly do something like having two separate instances (one for base, another for large) at different URL paths but like I said I need to do some playing around with it. The other issue is if this is being made available to the public I doubt I'd be comfortable without front-ending the entire thing with Cloudflare (or similar) and Cloudflare (and others) have things like 100s timeouts for final HTTP response (Websockets could get around this).
Thanks for providing the Slim Shady examples, as a life-long hip hop enthusiast I'm not offended by the content in the slightest.
Whisper is great but at the point we get to kludging various things together it might start to make more sense to use something like Nvidia NeMo[1] which was built with all of this in mind and more.
Anyway, I'll be making an issue soon!