> Align text to voice (the hardest part) using some private api

That's also the part that would be most interesting to have explained. Is it language-agnostic? After all, the title says "in any language", but I can't think of any text-audio alignment algorithms that don't require a language-specific model. (Unless you just count characters and assume they map linearly to time, which I'd expect to go very badly.)

Having worked for many years in a linguistics research lab where we spent a lot of money paying people to edit and align subtitles and audio transcripts, and having largely written what was at the time the most sophisticated subtitle-and-transcript editing tool available, I can confirm: counting characters and mapping them linearly to timespan, even after isolating vocals, does indeed go very poorly. And much worse when there's singing involved.

So let’s play, if you can guess the align method I’ll open source it :)

Alternately, since you say speech recognition isn't "even close", I might try going the other way--doing text-to-speech on the audio stream, attempting to align the two speech tracks, and the back-porting the timecodes from audio alignment onto the text.

But that seems a lot more complicated... so, unlikely.

A way to cheat that would probably work good enough most of the time would be to spectrographic analysis on the audio stream to identify syllables, and then similarly just count syllables in the known text and line those up. That works better the more consistent your spelling system is, though, and still requires language-specific modelling. If you actually want to do a decent job cross-linguistically, you'd need in the general case a dictionary for every supported language listing syllable counts for each word (because not everybody's orthography is transparent enough to make simple models like counting character sequences work).

If you actually have a fully language-agnostic algorithm for aligning text to audio that's actually decently accurate, though, that's gotta be worth at least a Master's degree in computational linguistics, 'cause on the face of it it doesn't seem to me (who has such a Masters degree) that it should even theoretically be possible.

You are close enough, so I have to respect my word. I’m not a genius, just a lego builder, I’ve tried a lot of methods, from DL to ML but aeneas project (with some optimizations) gave me the best results. Amazing project and even better personality. Take a look at https://github.com/readbeyond/aeneas Together with espeak-ng, you can get good results for line level alignment for 108 languages.