Great job. I must say that the speech synthesis sounds pretty realistic. I talked with Jobs, Musk and Obama and liked how they sounded and more importantly how they handled the questions. Do you mind sharing the entire stack you used to build this? Very well done!

Thanks much appreciated! It was a mixture of some the latest TTS models. Azure speech to text. Gpt ofc. And some other tools for handling conversational stuff (like interruptions).

Nicely done. Does Azure Speech to Text also handle speech synthesis and provide out of the box voices for different characters or you had to build your own model to do this? It's impressive if their service can do it all: speech recognition, speech to text and text to speech and in near real-time. I should take a closer look at the Azure ML stack :)

I've been using the Azure Cognitive Services speech recognition and text-to-speech for my own locally run 'speech-to-speech' GPT assistant application.

I found the Azure speech recognition to be fantastic, almost never making mistakes. The latency is also at a level that only the big cloud providers can reach. A locally run alternative I use is Vosk [0] but this is nowhere near as polished as Azure speech recognition and limits conversation to simple topics. (Running whisper.cpp locally is not an option for me, too heavy and slow on my machine for a proper conversation)

The default Azure models available for text-to-speech are great too. There are around 500 models in a wide variety of languages. Using SSML [1] can also really improve the quality of interactions. A subset of these voices have certain capabilities (like responding with emotions, see 'Speaking styles and roles').

Though in my opinion the default Azure voice models have nothing on what OP is providing. The Scarlett Johansson voice is really really good, especially combined with the personality they have given it. I would love to be able to run this model locally on my machine if OP is willing to share some information about it!

Maybe OP could improve the latency of Banterai by dynamically setting the Azure region for speech recognition based on the incoming IP. I see that 'eastus' is used even though I'm in West Europe.

But other than that I think this is the best 'speech-to-speech AI' demo I've seen so far. Fantastic job!

[0] https://github.com/alphacep/vosk-api/

[1] https://learn.microsoft.com/en-us/azure/cognitive-services/s...