What does HackerNews think of wer_are_we?
Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.
Are there any open benchmarks like this for all models that are actually runnable like the data exposed in https://github.com/syhw/wer_are_we but with some of your additional metrics?
Whisper SoTA
LibriSpeech test-clean 2.7% 1.8%
LibriSpeech test-other 5.6% 2.9%
Switchboard 13.1% 4.9%
CallHome 15.8% 9.5%
The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.You could of course find cases where this term makes other sense (or does not at all), since English is a flexible language, but I think that in areas where we obviously discuss AI/ML, let's just use the de-facto term and make everyone's lives easier.
I've been working with Facebook's wav2letter project and the results (speed on CPU, command accuracy) are extremely good in my experience. They also hold the "state of the art" for librispeech (a common benchmark) on wer_are_we [1]. Granted, that's with a 2GB model that doesn't run very well on CPU, but I think most of the fully "state of the art" models are computationally expensive and expected to run on GPU. Wav2letter has other models that are very fast on CPU and still extremely accurate.
You can run their "Streaming ConvNets" model on CPU to transcribe multiple live audio streams in parallel, see their wav2letter@anywhere post for more info [2]
I am getting very good accuracy on the in-progress model I am training for command recognition (3.7% word error rate on librispeech clean, about 8% WER on librispeech other, 20% WER on common voice, 3% WER on "speech commands"). I plan to release it alongside my other models here [5] once I'm done working on it.
There's a simple WER comparison between some of the command engines here [3] Between this and wer_are_we [1] it should give you a general idea of what to expect when talking about Word Error Rate (WER). (Note the wav2letter-talonweb entry in [3] is a rather old model I trained, known to have worse accuracy, it's not even the same NN architecture).
----
As far as constraining the vocabulary, you can try train a kenlm language model for kaldi, deepspeech, and wav2letter by grabbing KenLM and piping normalized (probably lowercase it and remove everything but ascii and quotes) text into lmplz:
cat corpus.txt | kenlm/build/bin/lmplz -o 4 > model.arpa
And you can turn it into a compressed binary model for wav2letter like this: kenlm/build/bin/build_binary -a 22 -q 8 -b 8 trie model.arpa model.bin
There are other options, like using a "strict command grammar", but I don't have enough context as to how you want to program this to guide you there.I also have tooling I wrote around wav2letter, such as wav2train [4] which builds wav2letter training and runtime data files for you.
I'm generally happy to talk more and answer any questions.
----
[1] https://github.com/syhw/wer_are_we
[2] https://ai.facebook.com/blog/online-speech-recognition-with-...
[3] https://github.com/daanzu/kaldi-active-grammar/blob/master/d...
It's also too bad this doesn't mention any traditional HMM-based ASR techniques, as HMMs continue to be used on many SOTA systems, particularly those that can be reproduced publicly: https://github.com/syhw/wer_are_we
Here something similar for speech recognition: https://github.com/syhw/wer_are_we