Mozilla TTS is a great open source speech generator that uses state of the art ML models. https://github.com/mozilla/TTS

If you need good TTS, this is it. It sounds as good as Google, Apple, or Amazon products, if you have a good data set. The pre trained models are decent enough for real use cases.

Datasets are the problem. Quality data is expensive. It takes about 24 hours of aligned and annotated quality voice samples to make a model.

The most common "open" dataset is LJSpeech. It's kinda noisy and recorded with a bad mic. And that greatly degrades quality of generated audio. It's still miles better than anything 5+ years ago, but has a metallic quality.

If anyone is interested, you can record your own voice to make a model for Mozilla TTS. What isn't talked about, is that modern ML voice models aren't synthetic voices, they're audio deepfakes. Clones. We're talking to the point your friends couldn't tell if it was you or the model on the phone.

Some people might find that scary but I think it's incredibly cool. If you build a great quality model for Mozilla TTS your voice can outlive you.

I think the creator of the project has moved on and it is not maintained anymore. But they have https://github.com/coqui-ai/TTS