I created a toolkit to evaluate many different speech recognition engines.
https://github.com/robmsmt/SpeechLoop
Comparing speech systems can take a long time esp for a dev who doesn't have the background in audio/ml. How do you know which one will work best? Will new shiny transformer model perform well enough? Most end up using one of the big tech companies existing API to throw their data at. Whilst this is convenient, I think that it's a travesty that opensource speech systems have not are not as easy to use. I was hoping to change that to make it easy to evaluate and compare them!
I'm likely to lose the use of my hands in the next few years so I've been trying to figure this out from the user perspective (for Linux) for a few years to try to sort of set up and get used to the tools I'll need later in life.
I've been using Almond, but it's really not good. I don't know how I might help but I'm definitely interested in the results... if I could use a high quality microphone to open a program, select menus, and type accurately (and have commands to press arrow keys) I think I'd be all set. I would be able to do anything I wanted, even if it was a bunch of steps.
I remember Dragon Naturallyspeaking in like 1995 being basically capable of doing all of this, and I was able to completely control a computer in like 1995 with speech and now I can't. It's extremely strange for 26 years of development.
It is as if all the tools try to be so clever that instead of assuming the user can learn new tricks, to me it should be the same as learning to type or use a mouse. Yeah, I used to have to say "backspace backspace period space capital while" to get fine details, but at least it was possible. I could even select things with voice commands. I just hope that we don't lose sight of the value of voice recognition as a general input device in search of which model performs best on accuracy alone.
I've not heard of Almond, but I have seen the following projects which might be helpful:
- Dragonfly: https://github.com/dictation-toolbox/dragonfly
- Demo: https://www.youtube.com/watch?v=Qk1mGbIJx3s / Software: https://github.com/daanzu/kaldi-active-grammar
Far field audio is usually harder for any speech system to get correct, so having a good quality mic and using it nearby will _usually_ help with the transcription quality. As a long time Linux user, I would love to see it get some more powerful voice tools - really hope that this opens up over the next few years. Feel free to drop me an email (on my profile) happy to help with setup on any of the above.