What does HackerNews think of wer_are_we?

Attempt at tracking states of the arts and recent results (bibliography) on speech recognition.

Great breakdown… with some interesting results and a ton of effort.

Are there any open benchmarks like this for all models that are actually runnable like the data exposed in https://github.com/syhw/wer_are_we but with some of your additional metrics?

Comparing this model's word error rates to the state of the art [1] on a few common test sets:

                           Whisper    SoTA
  LibriSpeech test-clean      2.7%     1.8%
  LibriSpeech test-other      5.6%     2.9%
  Switchboard                13.1%     4.9%
  CallHome                   15.8%     9.5%
The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.

[1] https://github.com/syhw/wer_are_we

"Human-level intelligence/performance" is a term that is often used in many ML tasks to indicate a top-level performance by a very performant human, to compare the performance to, performance at that specific task which is being discussed. Perhaps not world's best human (but sometimes, like in AlphaStar), but at least someone competent at the task (for example in https://github.com/syhw/wer_are_we). It is just a term to use, to gauge and compare how well a network operates.

You could of course find cases where this term makes other sense (or does not at all), since English is a flexible language, but I think that in areas where we obviously discuss AI/ML, let's just use the de-facto term and make everyone's lives easier.

Hi, I'm the dev behind https://talonvoice.com

I've been working with Facebook's wav2letter project and the results (speed on CPU, command accuracy) are extremely good in my experience. They also hold the "state of the art" for librispeech (a common benchmark) on wer_are_we [1]. Granted, that's with a 2GB model that doesn't run very well on CPU, but I think most of the fully "state of the art" models are computationally expensive and expected to run on GPU. Wav2letter has other models that are very fast on CPU and still extremely accurate.

You can run their "Streaming ConvNets" model on CPU to transcribe multiple live audio streams in parallel, see their wav2letter@anywhere post for more info [2]

I am getting very good accuracy on the in-progress model I am training for command recognition (3.7% word error rate on librispeech clean, about 8% WER on librispeech other, 20% WER on common voice, 3% WER on "speech commands"). I plan to release it alongside my other models here [5] once I'm done working on it.

There's a simple WER comparison between some of the command engines here [3] Between this and wer_are_we [1] it should give you a general idea of what to expect when talking about Word Error Rate (WER). (Note the wav2letter-talonweb entry in [3] is a rather old model I trained, known to have worse accuracy, it's not even the same NN architecture).

----

As far as constraining the vocabulary, you can try train a kenlm language model for kaldi, deepspeech, and wav2letter by grabbing KenLM and piping normalized (probably lowercase it and remove everything but ascii and quotes) text into lmplz:

    cat corpus.txt | kenlm/build/bin/lmplz -o 4 > model.arpa
And you can turn it into a compressed binary model for wav2letter like this:

    kenlm/build/bin/build_binary -a 22 -q 8 -b 8 trie model.arpa model.bin
There are other options, like using a "strict command grammar", but I don't have enough context as to how you want to program this to guide you there.

I also have tooling I wrote around wav2letter, such as wav2train [4] which builds wav2letter training and runtime data files for you.

I'm generally happy to talk more and answer any questions.

----

[1] https://github.com/syhw/wer_are_we

[2] https://ai.facebook.com/blog/online-speech-recognition-with-...

[3] https://github.com/daanzu/kaldi-active-grammar/blob/master/d...

[4] https://github.com/talonvoice/wav2train

[5] https://talonvoice.com/research/

There’s recently Common Voice from Mozilla, which is a huge free English dataset (1500 hours and growing), and wer_are_we [1] has shown really impressive accuracy increases in published research the past few years. Exciting times.

[1] https://github.com/syhw/wer_are_we

I'm a little confused about the title because the first paper is from 2014.

It's also too bad this doesn't mention any traditional HMM-based ASR techniques, as HMMs continue to be used on many SOTA systems, particularly those that can be reproduced publicly: https://github.com/syhw/wer_are_we

This looks like it's about image mostly/only.

Here something similar for speech recognition: https://github.com/syhw/wer_are_we

Since Kaldi is a toolkit, it can be used to build nearly any ASR architecture. See here [0] for a comprehensive comparison of the Word Error Rate of various architectures.

[0]: https://github.com/syhw/wer_are_we