What does HackerNews think of whisper-asr-webservice?

Whispers of A.I.’s Modular Future | Feb 2023

What utilities related to Whisper do you wish existed? What have you had to build yourself?

On the end user application side, I wish there was something that let me pick a podcast of my choosing, get it fully transcribed, and get an embeddings search plus answer q&a on top of that podcast or set of chosen podcasts. I've seen ones for specific podcasts, but I'd like one where I can choose the podcast. (Probably won't build it)

Also on the end user side, I wish there was an Otter alternative (still paid $30/mo, but unlimited minutes per month) that had longer transcription limits. (Started building this, not much interest from users though)

Things I've seen on the dev tool side:

Gladia (API call version of Whisper)

Whisper.cpp

Whisper webservice (https://github.com/ahmetoner/whisper-asr-webservice) - via this thread

Live microphone demo (not real time, it still does it in chunks) https://github.com/mallorbc/whisper_mic

Streamlit UI https://github.com/hayabhay/whisper-ui

Whisper playground https://github.com/saharmor/whisper-playground

Real time whisper https://github.com/shirayu/whispering

Whisper as a service https://github.com/schibsted/WAAS

Improved timestamps and speaker identification https://github.com/m-bain/whisperX

MacWhisper https://goodsnooze.gumroad.com/l/macwhisper

Crossplatform desktop Whisper that supports semi-realtime https://github.com/chidiwilliams/buzz

Whispers of A.I.’s Modular Future | Feb 2023

Expand Context ↕

Getting a server running is easy if you use https://github.com/ahmetoner/whisper-asr-webservice as a guide. It's then a REST API which you post the file to and get the transcription in return.

But I don't know what you consider being "in production". If it's for internal use then it is enough.

Here are some comparisons of running it on GPU vs CPU According to https://github.com/MiscellaneousStuff/openai-whisper-cpu the medium model needs 1.7 seconds to transcribe 30 seconds of audio when run on a GPU.

Show HN: I made a free transcription service powered by Whisper AI | Nov 2022

Expand Context ↕

> based on the sound of my power supply

Hah, I love that - "benchmark by fan speed".

Good to know - I've tried large and it works but in my case I'm using whisper-asr-webservice[0] which loads the configured model for each of the workers on startup. I have some prior experience with Gunicorn and other WSGI implementations so there's some playing around and benchmarking to be done on the configured number of workers as the GPU utilization of Whisper is a little spiky and whisper-asr-webservice does file format conversion on CPU via ffmpeg. Default was two workers, is now one but I've found as many as four with base can really improve overall utilization, response time, and scale (which certainly won't be possible with large).

OPs node+express implementation shells out to Whisper which gives more control (like runtime specification of model) but almost certainly has to end up slower and less efficient in the long run as the model is obviously loaded from scratch on each invocation. I'm front-ending whisper-asr-webservice with traefik so I could certainly do something like having two separate instances (one for base, another for large) at different URL paths but like I said I need to do some playing around with it. The other issue is if this is being made available to the public I doubt I'd be comfortable without front-ending the entire thing with Cloudflare (or similar) and Cloudflare (and others) have things like 100s timeouts for final HTTP response (Websockets could get around this).

Thanks for providing the Slim Shady examples, as a life-long hip hop enthusiast I'm not offended by the content in the slightest.

[0] - https://github.com/ahmetoner/whisper-asr-webservice

Show HN: I made a free transcription service powered by Whisper AI | Nov 2022

Expand Context ↕

I think there's been talk to do speaker diarization with whisper-asr-webservice[0] which is also written in python and should be able to make use of goodies such as pyannote-audio, py-webrtcvad, etc.

Whisper is great but at the point we get to kludging various things together it might start to make more sense to use something like Nvidia NeMo[1] which was built with all of this in mind and more.

[0] - https://github.com/ahmetoner/whisper-asr-webservice

[1] - https://github.com/NVIDIA/NeMo

Show HN: I made a free transcription service powered by Whisper AI | Nov 2022

Expand Context ↕

For what's it worth my approach has been running a tweaked whisper-asr-webservice[0] behind traefik behind Cloudflare. Traefik enables end to end SSL (with Cloudlare MITM, I know) and also helps put the brakes on a little so even legitimate traffic that makes it through Cloudflare gets handled optimally and gracefully. I could easily deploy your express + node code instead (and probably will anyway because I just like that approach more than python).

Anyway, I'll be making an issue soon!

[0] - https://github.com/ahmetoner/whisper-asr-webservice