What does HackerNews think of Real-Time-Voice-Cloning?

Clone a voice in 5 seconds to generate arbitrary speech in real-time

Language: Python

#14 in Deep learning
#102 in Python
#8 in Tensorflow
Man, based on how often Alexa misunderstands me or tries to upsell me on some feature after I get what I want from it, mine is going to sound really pissed off.

I also can’t imagine speaking to my Grandma/pa the way I speak to Alexa either, so that’s food for thought.

Jests aside, this is already possible with ML and a large enough data set. There’s nothing state of the art to see here, right? Just implementing existing tech at Enterprise scale/telling the masses?

For examples, see 15.ai or https://github.com/CorentinJ/Real-Time-Voice-Cloning. There’s another commercialized service similar to the latter example here I saw recently but I can’t recall the name. I wish the article had some more technical details.

From README:

> This repository is forked from Real-Time-Voice-Cloning which only support English.

https://github.com/CorentinJ/Real-Time-Voice-Cloning

I'm the author of FakeYou.com, so I have a little experience in this area. (We used to train GlowTTS models ourselves before turning it over to our users, which has had mixed results in terms of quality.)

This appears to be a repackaging of RealTimeVoiceCloning [1], albeit with a few additions, such as GSTs.

No matter what the repo claims, your results will depend on high quality data. Lots of it, and with ample fine tuning. Demo videos are absolutely cherry picked.

If you're picking this up for a project, HiFi-Gan is pretty much the best vocoder right now. Tacotron still produces great results, though there are lots of other interesting model architectures.

[1] https://github.com/CorentinJ/Real-Time-Voice-Cloning