It's interesting that TTS is getting better and better while consumer access to it is more and more restricted. A decade ago there were a half dozen totally separate TTS engines I could install on my phone and my Kindle came with its own that worked on any book.

Your average mobile processor doesn't have anywhere near enough processing power to run a state of the art text to speech network in real-time. Most text to speech on mobile hardware are stream from the cloud.

I had a lot of success using FastSpeech2 + MB MelGAN via TensorFlowTTS: https://github.com/TensorSpeech/TensorFlowTTS. There are demos for iOS and Android which will allow you to run pretty convincing, modern TTS models with only a few hundred milliseconds of processing latency.