What does HackerNews think of wer_are

OpenAI Whisper Analysis | Sep 2022

Great breakdown… with some interesting results and a ton of effort.

Are there any open benchmarks like this for all models that are actually runnable like the data exposed in https://github.com/syhw/wer_are_we but with some of your additional metrics?

Whisper – open source speech recognition by OpenAI | Sep 2022

Comparing this model's word error rates to the state of the art [1] on a few common test sets:

                           Whisper    SoTA
  LibriSpeech test-clean      2.7%     1.8%
  LibriSpeech test-other      5.6%     2.9%
  Switchboard                13.1%     4.9%
  CallHome                   15.8%     9.5%

The authors do explicitly state that they're trying to do a lot of fancy new stuff here, like be multilingual, rather than pursuing just accuracy.

[1] https://github.com/syhw/wer_are_we

GPT-3 | Jun 2020

Expand Context ↕

"Human-level intelligence/performance" is a term that is often used in many ML tasks to indicate a top-level performance by a very performant human, to compare the performance to, performance at that specific task which is being discussed. Perhaps not world's best human (but sometimes, like in AlphaStar), but at least someone competent at the task (for example in https://github.com/syhw/wer_are_we). It is just a term to use, to gauge and compare how well a network operates.

You could of course find cases where this term makes other sense (or does not at all), since English is a flexible language, but I think that in areas where we obviously discuss AI/ML, let's just use the de-facto term and make everyone's lives easier.

Ask HN: Non-cloud voice recognition for home use? | Mar 2020

Hi, I'm the dev behind https://talonvoice.com

I've been working with Facebook's wav2letter project and the results (speed on CPU, command accuracy) are extremely good in my experience. They also hold the "state of the art" for librispeech (a common benchmark) on wer_are_we [1]. Granted, that's with a 2GB model that doesn't run very well on CPU, but I think most of the fully "state of the art" models are computationally expensive and expected to run on GPU. Wav2letter has other models that are very fast on CPU and still extremely accurate.

You can run their "Streaming ConvNets" model on CPU to transcribe multiple live audio streams in parallel, see their wav2letter@anywhere post for more info [2]

I am getting very good accuracy on the in-progress model I am training for command recognition (3.7% word error rate on librispeech clean, about 8% WER on librispeech other, 20% WER on common voice, 3% WER on "speech commands"). I plan to release it alongside my other models here [5] once I'm done working on it.

There's a simple WER comparison between some of the command engines here [3] Between this and wer_are_we [1] it should give you a general idea of what to expect when talking about Word Error Rate (WER). (Note the wav2letter-talonweb entry in [3] is a rather old model I trained, known to have worse accuracy, it's not even the same NN architecture).

----

As far as constraining the vocabulary, you can try train a kenlm language model for kaldi, deepspeech, and wav2letter by grabbing KenLM and piping normalized (probably lowercase it and remove everything but ascii and quotes) text into lmplz:

    cat corpus.txt | kenlm/build/bin/lmplz -o 4 > model.arpa

And you can turn it into a compressed binary model for wav2letter like this:

    kenlm/build/bin/build_binary -a 22 -q 8 -b 8 trie model.arpa model.bin

There are other options, like using a "strict command grammar", but I don't have enough context as to how you want to program this to guide you there.

I also have tooling I wrote around wav2letter, such as wav2train [4] which builds wav2letter training and runtime data files for you.

I'm generally happy to talk more and answer any questions.

----

[1] https://github.com/syhw/wer_are_we

[2] https://ai.facebook.com/blog/online-speech-recognition-with-...

[3] https://github.com/daanzu/kaldi-active-grammar/blob/master/d...

[4] https://github.com/talonvoice/wav2train

[5] https://talonvoice.com/research/

On Voice Coding | Feb 2020

Expand Context ↕

There’s recently Common Voice from Mozilla, which is a huge free English dataset (1500 hours and growing), and wer_are_we [1] has shown really impressive accuracy increases in published research the past few years. Exciting times.

[1] https://github.com/syhw/wer_are_we

DeepSpeech 0.6 | Dec 2019

Expand Context ↕

I like reading through https://github.com/syhw/wer_are_we sometimes

A 2019 Guide for Automatic Speech Recognition | Sep 2019

I'm a little confused about the title because the first paper is from 2014.

It's also too bad this doesn't mention any traditional HMM-based ASR techniques, as HMMs continue to be used on many SOTA systems, particularly those that can be reproduced publicly: https://github.com/syhw/wer_are_we

Measuring the Progress of AI Research | Jul 2017

Expand Context ↕

This looks like it's about image mostly/only.

Here something similar for speech recognition: https://github.com/syhw/wer_are_we

Kaldi Speech Recognition Toolkit | Oct 2016

Expand Context ↕

Since Kaldi is a toolkit, it can be used to build nearly any ASR architecture. See here [0] for a comprehensive comparison of the Word Error Rate of various architectures.

[0]: https://github.com/syhw/wer_are_we

Deep Neural Networks for Acoustic Modelling and Speech Recognition [pdf] (2012) | Mar 2016

Expand Context ↕

This tracks the current state of speech recognition. https://github.com/syhw/wer_are_we

What does HackerNews think of wer_are_we?