What does HackerNews think of vosk-api?
Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
checking the weather in Sofia, Bulgaria requires cloud, current information. it's not "random speech". ESP SR capability issues don't mean that you cannot process it locally.
the comment was on "voice processing" i.e. sending speech to the cloud, not sending a call request to get the weather information.
besides, local intent detection, beyond 400 commands, there are great local STT options, working better than most cloud STTs for "random speech"
https://github.com/alphacep/vosk-api https://picovoice.ai/platform/cheetah/
I found the Azure speech recognition to be fantastic, almost never making mistakes. The latency is also at a level that only the big cloud providers can reach. A locally run alternative I use is Vosk [0] but this is nowhere near as polished as Azure speech recognition and limits conversation to simple topics. (Running whisper.cpp locally is not an option for me, too heavy and slow on my machine for a proper conversation)
The default Azure models available for text-to-speech are great too. There are around 500 models in a wide variety of languages. Using SSML [1] can also really improve the quality of interactions. A subset of these voices have certain capabilities (like responding with emotions, see 'Speaking styles and roles').
Though in my opinion the default Azure voice models have nothing on what OP is providing. The Scarlett Johansson voice is really really good, especially combined with the personality they have given it. I would love to be able to run this model locally on my machine if OP is willing to share some information about it!
Maybe OP could improve the latency of Banterai by dynamically setting the Azure region for speech recognition based on the incoming IP. I see that 'eastus' is used even though I'm in West Europe.
But other than that I think this is the best 'speech-to-speech AI' demo I've seen so far. Fantastic job!
[0] https://github.com/alphacep/vosk-api/
[1] https://learn.microsoft.com/en-us/azure/cognitive-services/s...
Your app is neat in that it can record from the Lock Screen. I was curious to try out the new open ai model.
Too often, iOS has a problem of too many clicks to do the most basic of things.
https://github.com/alphacep/vosk-api
Still, I've found that the Big players have much better recognition models, and the post-processing that I assume they do (grammatical, maybe syntactical inferences that improve the end result) are probably much more powerful too.
https://github.com/ideasman42/nerd-dictation
get speech models here:
https://github.com/alphacep/vosk-api
HN discussion:
Here's another project that builds on top of VOSK to provide a tighter integration with Linux: https://github.com/ideasman42/nerd-dictation
Does anyone have experience using it? Is it any good?
E.g. some very active projects are:
* Kaldi (https://github.com/kaldi-asr/kaldi/) obviously, probably the most famous one, and most mature one. For standard hybrid NN-HMM models and also all their more recent lattice-free MMI (LF-MMI) models / training procedure. This is also heavily used in industry (not just research).
* ESPnet (https://github.com/espnet/espnet), for all kind of end-to-end models, like CTC, attention-based encoder-decoder (including Transformer), and transducer models.
* Espresso (https://github.com/freewym/espresso).
* Google Lingvo (https://github.com/tensorflow/lingvo). This is the open source release of Googles internal ASR system, and used by Google in production (their internal version of it, which is not too much different).
* NVIDIA OpenSeq2Seq (https://github.com/NVIDIA/OpenSeq2Seq).
* Facebook Fairseq (https://github.com/pytorch/fairseq). Attention-based encoder-decoder models mostly.
* Facebook wav2letter (https://github.com/facebookresearch/wav2letter). ASG model/training.
* Vosk (https://github.com/alphacep/vosk-api). Offline lightweight speech recognition API with support for 10 languages.
And there are much more.
Other good ones are https://github.com/daanzu/kaldi-active-grammar and https://talonvoice.com/
There are toolkits for research like https://github.com/kaldi-asr/kaldi, https://github.com/espnet/espnet, wav2letter, Espresso, Nvidia/Nemo, https://github.com/didi/athena. You can try them too if you want to go deep. Some of them have interesting capabilities.
However, kaldi-active-grammar specializes in real time command and control, with advanced features that don't really apply to your use case. Vosk [1] is probably a simpler, better fit for you. It likewise uses Kaldi and can use my models, and offers some others of its own as well.
Neither are particularly focused on transcription per se, but they are open.
There are a few good OSS offline deep speech libraries including Mozilla DeepSpeech [3], but their resource footprint is too high. We settled on the currently less mature vosk [4], which is based on Kaldi [5] (a more popular deep speech pipeline), and includes a number of low-footprint, pretrained language models for real-time streaming inference. Research has shown how to deploy efficient deep speech models on CPUs [6], so we're hoping those gains will translate to faster performance on commodity laptops soon. You can follow this issue [7] for updates on our progress. Contributions are welcome!
[1]: https://github.com/OpenASR/idear/
[2]: https://cmusphinx.github.io/
[3]: https://github.com/mozilla/DeepSpeech
[4]: https://github.com/alphacep/vosk-api
[5]: https://github.com/kaldi-asr/kaldi
[6]: https://ai.facebook.com/blog/a-highly-efficient-real-time-te...
https://github.com/alphacep/vosk-api
Advantages are:
1) Supports 7 languages - English, German, French, Spanish, Portuguese, Chinese, Russian
2) Works offline even on lightweight devices - Raspberry Pi, Android, iOS
3) Install it with simple `pip install vosk`
4) Model size per language is just 50Mb
5) Provides streaming API for the best user experience (unlike popular speech-recognition python package)
6) There are APIs for different languages too - java/csharp etc.
7) Allows quick reconfiguration of vocabulary for best accuracy.
8) Supports speaker identification beside simple speech recognition