I've been using the Azure Cognitive Services speech recognition and text-to-speech for my own locally run 'speech-to-speech' GPT assistant application.

I found the Azure speech recognition to be fantastic, almost never making mistakes. The latency is also at a level that only the big cloud providers can reach. A locally run alternative I use is Vosk [0] but this is nowhere near as polished as Azure speech recognition and limits conversation to simple topics. (Running whisper.cpp locally is not an option for me, too heavy and slow on my machine for a proper conversation)

The default Azure models available for text-to-speech are great too. There are around 500 models in a wide variety of languages. Using SSML [1] can also really improve the quality of interactions. A subset of these voices have certain capabilities (like responding with emotions, see 'Speaking styles and roles').

Though in my opinion the default Azure voice models have nothing on what OP is providing. The Scarlett Johansson voice is really really good, especially combined with the personality they have given it. I would love to be able to run this model locally on my machine if OP is willing to share some information about it!

Maybe OP could improve the latency of Banterai by dynamically setting the Azure region for speech recognition based on the incoming IP. I see that 'eastus' is used even though I'm in West Europe.

But other than that I think this is the best 'speech-to-speech AI' demo I've seen so far. Fantastic job!

[0] https://github.com/alphacep/vosk-api/

[1] https://learn.microsoft.com/en-us/azure/cognitive-services/s...