I've ran Whisper locally via [1] with one of the medium sized models and it was damn good at transcribing audio from a video of two people having a conversation.

I don't know exactly what the use case is where people would need to run this via API; the compute isn't huge, I used CPU only (an M1) and the memory requirements aren't much.

[1] https://github.com/ggerganov/whisper.cpp