Probably not realistic. On an M1 Pro MBP, Whisper runs far slower than real time. Think on the order of days for a 2 hour recording.

I’ve been doing transcription work for public meetings. Whisper is truly incredible in terms of error rate even in extremely challenging circumstances (obscure acronyms, unusual terms, unusual names, poor recording quality). I was seeing only a few errors per hour; most things that look like errors are in fact accurate representation of humans saying weird things. But I have to run it on my desktop with CUDA enabled. With the medium model it is iirc barely faster than real time. I only have a 1070 so maybe it is better with more modern hardware.

Whisper does also have some slightly strange behavior with silence and very long recordings. I might do a blog post once I’ve got more experience.

On M1 Pro, with Greedy decoder and medium model, I can transcribe 1 hour audio in just 10 minutes (~x6 real-time) [0].

[0] https://github.com/ggerganov/whisper.cpp