What does HackerNews think of gentle?

YouTube Transcript – read YouTube videos | Dec 2022

Thank you!

Yes, exactly. We do forced alignment when you edit your transcript. The new words don't have any timestamps, so we need to align them. For short sections we use interpolation. If we need align whole sections we use Gentle[^1].

[^1]: https://github.com/lowerquality/gentle

PySceneDetect is tool for detecting scene in movies | Sep 2019

Expand Context ↕

Transcribe does “Recognize Multiple Speakers” — it’s on the page that you linked to, in the list of features.

I would also recommend comparing it to Google’s and maybe MS/Azure’s services.

Aside from the fidelity of the transcription itself, and the accuracy of disambiguating the different speakers, I’m also not convinced that all of these services will give you an end timestamp (start timestamps are sometimes there, but not necessarily for every word or sentence).

You could try a multi prong approach by doing the transcript to get text and speakers only (using the services mentioned above), and then using an aligner such as “gentle” [0] to find the start / end times.

You will have some gaps and wrongly transcribed words, but it may be a start..!

[0] https://github.com/lowerquality/gentle

Guide to Speech Recognition with Python | Mar 2018

For people who want simple, out of the box stuff (not necessarily in Python) for just getting phonemes I can also recommend [0]. Not amazing recognition quality, but dead simple setup, and it is possible to integrate a language model as well (I never needed one for my task). The author showed it as well in [1], but kind of skimmed right by - but to me if you want to know speech recognition in detail, pocketsphinx-python is one of the best ways. Customizing the language model is a huge boost in domain specific recognition.

Large company APIs will usually be better at generic speaker, generic language recognition - but if you can do speaker adaptation and customize the language model, there are some insane gains possible since you prune out a lot of uncertainty and complexity.

If you are more interested in recognition and alignment to a script, "gentle" is great [2][3]. The guts also have raw Kaldi recognition, which is pretty good for a generic speech recognizer but you would need to do some coding to pull out that part on its own.

For a decent performing deep model, check into Mozilla's version of Baidu's DeepSpeech [4].

If doing full-on development, my colleague has been using a bridge between PyTorch (for training) and Kaldi (to use their decoders) to good success [5].

[0] how I use pocketsphinx to get phonemes, https://github.com/kastnerkyle/ez-phones

[1] https://github.com/cmusphinx/pocketsphinx-python

[2] https://github.com/lowerquality/gentle

[3] how I use gentle for foreced alignment, https://github.com/kastnerkyle/raw_voice_cleanup/tree/master...

[4] https://github.com/mozilla/DeepSpeech

[5] https://github.com/mravanelli/pytorch-kaldi