"Once your request is submitted, it takes one to two months to process the book and conduct quality checks."

My guess is that these generated voices are far from perfect and someone has to go in and crank the algorithm to get a fair number of passages to not sound strange.

Even in the example Helena there is a word at the end of a sentence that sounds like it should be in the middle and has a bit of weirdness to it. Still, very impressive, I think better than I remember Amazon Poly sounding.

Why is that we still can't have a perfect or near-perfect text-to-speech given all the astonishing advances in ML taking place? Is TTS an area nobody is really interested in or is it harder than generating beautiful pictures and sophisticated writings?

This thing by Apple already sounds way better than the best I heard previously (NextUp Ivona) but it is not an instant-result offline tool yet and that's sad.

I wanted to make a human-like reading feature for our language-learning software. Training a model isn't too hard using something like https://github.com/coqui-ai/TTS.

The weak link was the available free/open datasets. You needed a single speaker with a pleasant voice, 20hrs+ material from varied sources, recorded in a good recording enviroment with a good mic etc. For English, the go-to was LJSpeech, which doesn't fulfill all these requirements. I say 'was', as I haven't followed developments recently.

Last year we decided to make our own dataset with a Irish woman, Jenny. She has a soft Irish lilt.

Never got around around to training the model, but I will upload the raw audio and prompts here in a few hours (need to pay my internet bill in town..):

https://github.com/dioco-group/jenny-tts-dataset/blob/main/R...