What models/architecture are they using?
From the descriptions here it sounds a lot like AudioLM / SPEAR TTS / some of Meta's recent multilingual TTS approaches, although those models are not open source, sounds like PlayHT's approach is in a similar spirit. The discussion of "mel tokens" is closer to what I would call the classic TTS pipeline in many ways... PlayHT has generally been kind of closed about what they used, would be interesting to know more.
If you are interested in some recent open to sample-from work pushing on this kind of random expressiveness (sometimes at the expense of typical "quality" in terms of TTS), Bark is pretty interesting [1]. Though the audio quality suffers a bit from how they realize sequences -> waveforms, the prosody and timing is really interesting.
I assume the key factor here is high quality, emotive audio with good data cleaning processes. Probably not even a lot of data, at least in the scale of "a lot" in speech, e.g. ASR (millions of hours) or TTS (hundreds to thousands). As opposed to some radically new architectural piece never before seen in the literature, there are lots of really nice tools for emotive and expressive TTS buried in recent years of publications.
Tacotron 2 is perfectly capable of this type of stuff as well, as shown by Dessa [2] a few years ago (this writeup is a nice intro to TTS concepts). With the limit largely being, at some point you haven't heard certain phonetic sounds before in a voice, and need to do something to get plausible outcomes for new voices.
[0] Discussion here https://github.com/neonbjb/tortoise-tts/issues/182#issuecomm...
[1] https://www.tiktok.com/@jonathanflyfly/video/722513498370947...
[1a] Bark github https://github.com/suno-ai/bark
[2] https://medium.com/dessa-news/realtalk-how-it-works-94c1afda...