In case anyone knows, what's the defensible moat here?

I can get almost the same quality using open source models. Plus I can fine-tune them to get custom voices. That means any company who needs TTS is cheaper off paying me once to build them a customized open source solution instead of forever paying this company per minute.

Hmmm i don't know of any open source project that can get similar quality? Can you name one? This one also allows fine tuning for custom voices on a minute of audio and it works great.

TorToiSe[0] is pretty good but I agree 11 is currently state of the art. Won't be long until GP is correct though. 1.5 years at best is my guess. The next moat will be multiple languages and maybe something like more control over the tone which is something perhaps more suited to a product.
