What does HackerNews think of tacotron?

Free Public Domain Audiobooks | May 2023

that is interesting! can you describe how that transcription process works? if i have an audiobook and the corresponding ebook/pdf, isn't that already transcribed? or does transcription here mean something else?

i'd also be happy to use an existing voice (the english one from keith ito sounds pleasant enough) but i am confused how to use it to read a book. there is code for a model that learns to synthesize speech from the data: https://github.com/keithito/tacotron but i don't see how to get at the end result which i think would hopefully also be available somewhere so i can just use it to read something.

AI voice actors sound more human than ever–and are ready to hire | Jul 2021

Expand Context ↕

https://github.com/NVIDIA/tacotron2

https://github.com/keithito/tacotron

https://github.com/NVIDIA/WaveGlow

TTS: Text-to-Speech for All | Apr 2021

Maybe interesting:

https://colab.research.google.com/drive/1SPl226SwzrfMZltrVag...

https://github.com/keithito/tacotron

https://www.youtube.com/watch?v=ijhZR43TOwc

https://heartbeat.fritz.ai/a-2019-guide-to-speech-synthesis-...

Varying Speaking Styles with Neural Text-to-Speech | Nov 2018

Expand Context ↕

Worth noting that a big chunk of the core TTS code here is built on tools from other researchers like Ryuichi Yamamoto and Keith Ito, and they have great implementations to check out as well.

The best quality I have heard in OSS is probably [1] from Ryuichi using the Tacotron 2 implementation of Rayhane Mamah, which is loosely what NVidia based some of their baseline code on recently as well [3][4].

There's also a colab notebook for this stuff, so you can try it directly without any pain https://colab.research.google.com/github/r9y9/Colaboratory/b...

I also have my own pipeline for this (using some utilities from the above authors + a lot of my own hacks), for a forthcoming paper release here https://github.com/kastnerkyle/representation_mixing/tree/ma... , see the minimal demo. It has pretty fast sampling, but the audio quality is not as high as WaveNet. I'd really like to tie in with WaveGlow, but it's work in progress for me so far.

NOTE: None of these have voice adaptivity per se, but given a model which trains well already + a multispeaker dataset with IDs such as VCTK, a lot of things become possible as getting a baseline model and data pipeline for TTS is quite difficult.

[0] https://github.com/keithito/tacotron

[1] https://r9y9.github.io/blog/2018/05/20/tacotron2/

[2] https://github.com/Rayhane-mamah/Tacotron-2

[3] https://github.com/NVIDIA/waveglow

[4] https://github.com/NVIDIA/tacotron2

A TensorFlow implementation of Baidu's DeepSpeech architecture | Dec 2017

Expand Context ↕

Thank you so much for this link, that is the best text-to-speech with an open architecture I've ever heard 'til now. Under https://github.com/keithito/tacotron you can find a pre-trained model based on this paper, although it isn't matching the quality yet. Maybe I can get some cluster time to train a new model using multiple datasets.

Edit: Another interesting one: http://research.baidu.com/deep-voice-3-2000-speaker-neural-t...