Could it generate speech with enough training?

And what is SOTA for TTS these days?

TortoiseTTS is pretty impressive: https://github.com/neonbjb/tortoise-tts