Free startup idea: Use Whisper with pyannote-audio[0]’s speaker diarization. Upload a recording, get back a multi-speaker annotated transcription.

Make a JSON API and I’ll be your first customer.

[0] https://github.com/pyannote/pyannote-audio

I think there's been talk to do speaker diarization with whisper-asr-webservice[0] which is also written in python and should be able to make use of goodies such as pyannote-audio, py-webrtcvad, etc.

Whisper is great but at the point we get to kludging various things together it might start to make more sense to use something like Nvidia NeMo[1] which was built with all of this in mind and more.

[0] - https://github.com/ahmetoner/whisper-asr-webservice

[1] - https://github.com/NVIDIA/NeMo