Author here. I would add to what the sibling comments have mentioned by saying that SotA results should be taken with a grain of salt. Our engine is capable of streaming (processing the audio as it's being recorded), which is not doable with architectures that have bidirectional decoders or attention mechanisms that require the whole encoder input ahead of time.

For real world applications, this is absolutely crucial, users want latency numbers on the order of milliseconds, not seconds. This is why, if you run a standard test set like LibriSpeech on, say, a commercial offering from Google, it will perform considerably worse than state of the art according to Google papers.

This repository [0] has a benchmark of some commercial offerings. Our model beats all of those on Librispeech clean and other (except for Speechmatics on Librispeech clean), as well as on Common Voice. But note that the Common Voice corpus used in that benchmark is very old.

In sum, I would compare this against solutions that go for the same space: fast, client-side ASR, rather than state of the art.

[0] https://github.com/Franck-Dernoncourt/ASR_benchmark#benchmar...

NVIDIA has QuartzNet which contains only 19M weights and achieves 3.9% on test clean without language model and less then 3 with LM. Code (Pytorch): https://github.com/NVIDIA/NeMo Paper: https://arxiv.org/pdf/1910.10261.pdf