Was there a big difference in accuracy depending on which model you used?

Yes, large was by far the best, but still not accurate enough that I'd be willing to put it into a fully automated pipeline. It would have gotten it right probably 75% of the time. Anything other than the large model was far too bad to even think about using.

kkielhofner

What was the performance, resource usage, etc of doing this with large? What's the speed like?

I'm still getting spun up on this but base delivers a pretty impressive 5-20x realtime on my RTX 3090. I haven't gotten around to trying the larger models and with only 24GB of VRAM I'm not sure what kind of success I'll have anyway...

In my case the goal was to actually generate tweets based on XYZ. As I've already said there were serious technical challenges so I abandoned the project but I was also a little concerned about the privacy, safety, etc issues of realtime or near-realtime reporting on public safety activity. I also streamed to broadcastify and it really seems like they insert artificial delay because of these concerns.

IanCal

You can run the larger models just fine on a 3090. Large takes about 10G for transcribing English.

For a 1:17 file it takes:

6s for base.en, I think 2s to load the model based on the sound of my power supply.

33s for large, I think 11s of which is loading the model.

Varies a lot with how dense the audio file is, this was me giving a talk so not the fastest and quite clean audio.

While I saw near perfect or perfect performance on many things with smaller models, the largest really are better . I'll upload a gist in a but with Rap God passed through base.en and large.

edit -

Timings (explicitly marked as language en and task transcribe):

base.en => 23s

large => 2m50

Audio length 6m10

Results (nsfw, it's Rap God by Eminem): https://gist.github.com/IanCal/c3f9bcf91a79c43223ec59a56569c...

Base model does well, given that it's a rap. Large model just does incredibly, imo. Audio is very clear, but it does have music too.

kkielhofner

> based on the sound of my power supply

Hah, I love that - "benchmark by fan speed".

Good to know - I've tried large and it works but in my case I'm using whisper-asr-webservice[0] which loads the configured model for each of the workers on startup. I have some prior experience with Gunicorn and other WSGI implementations so there's some playing around and benchmarking to be done on the configured number of workers as the GPU utilization of Whisper is a little spiky and whisper-asr-webservice does file format conversion on CPU via ffmpeg. Default was two workers, is now one but I've found as many as four with base can really improve overall utilization, response time, and scale (which certainly won't be possible with large).

OPs node+express implementation shells out to Whisper which gives more control (like runtime specification of model) but almost certainly has to end up slower and less efficient in the long run as the model is obviously loaded from scratch on each invocation. I'm front-ending whisper-asr-webservice with traefik so I could certainly do something like having two separate instances (one for base, another for large) at different URL paths but like I said I need to do some playing around with it. The other issue is if this is being made available to the public I doubt I'd be comfortable without front-ending the entire thing with Cloudflare (or similar) and Cloudflare (and others) have things like 100s timeouts for final HTTP response (Websockets could get around this).

Thanks for providing the Slim Shady examples, as a life-long hip hop enthusiast I'm not offended by the content in the slightest.

[0] - https://github.com/ahmetoner/whisper-asr-webservice