I ran into some minor glitches trying to install and use DeepSpeech couple of days ago. I’m sure they’ll be fixed soon enough but meanwhile hope this helps: https://www.phpied.com/taking-mozillas-deepspeech-for-a-spin...

It only works on "short", about 5 seconds or so, audio clips. (We should have documented this better, but I just put in a PR adding this to the documentation.)

However, you can use voice activity detection (VAD), for example webrtcvad from PyPI, to chop long audio into smaller bits that are able to be digested.

Maybe we should just put VAD in the client and have this occur automatically?

out of interest, do you also work on a reverse solution, text-to-speech? Most open source engines sadly still can't compete with commercial alternatives.

Maybe Tacotron will interest you? It's an end-to-end model, that's reasonably close to the state of the art:

https://google.github.io/tacotron/publications/tacotron/inde...

They are some open source implementations.

Thank you so much for this link, that is the best text-to-speech with an open architecture I've ever heard 'til now. Under https://github.com/keithito/tacotron you can find a pre-trained model based on this paper, although it isn't matching the quality yet. Maybe I can get some cluster time to train a new model using multiple datasets.

Edit: Another interesting one: http://research.baidu.com/deep-voice-3-2000-speaker-neural-t...