What does HackerNews think of vosk-api?

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node

Language: Jupyter Notebook

#6 in Android
#4 in iOS
#19 in Python
#2 in Raspberry Pi
first, good initiative! thanks for sharing. i think you gotta be more diligent and careful with the problem statement.

checking the weather in Sofia, Bulgaria requires cloud, current information. it's not "random speech". ESP SR capability issues don't mean that you cannot process it locally.

the comment was on "voice processing" i.e. sending speech to the cloud, not sending a call request to get the weather information.

besides, local intent detection, beyond 400 commands, there are great local STT options, working better than most cloud STTs for "random speech"

https://github.com/alphacep/vosk-api https://picovoice.ai/platform/cheetah/

I've been using the Azure Cognitive Services speech recognition and text-to-speech for my own locally run 'speech-to-speech' GPT assistant application.

I found the Azure speech recognition to be fantastic, almost never making mistakes. The latency is also at a level that only the big cloud providers can reach. A locally run alternative I use is Vosk [0] but this is nowhere near as polished as Azure speech recognition and limits conversation to simple topics. (Running whisper.cpp locally is not an option for me, too heavy and slow on my machine for a proper conversation)

The default Azure models available for text-to-speech are great too. There are around 500 models in a wide variety of languages. Using SSML [1] can also really improve the quality of interactions. A subset of these voices have certain capabilities (like responding with emotions, see 'Speaking styles and roles').

Though in my opinion the default Azure voice models have nothing on what OP is providing. The Scarlett Johansson voice is really really good, especially combined with the personality they have given it. I would love to be able to run this model locally on my machine if OP is willing to share some information about it!

Maybe OP could improve the latency of Banterai by dynamically setting the Azure region for speech recognition based on the incoming IP. I see that 'eastus' is used even though I'm in West Europe.

But other than that I think this is the best 'speech-to-speech AI' demo I've seen so far. Fantastic job!

[0] https://github.com/alphacep/vosk-api/

[1] https://learn.microsoft.com/en-us/azure/cognitive-services/s...

DeepSpeeech is very old software. Vosk works just fine https://github.com/alphacep/vosk-api. People even run tiny Whisper on Pi, though they have to wait ages.
https://github.com/alphacep/vosk-api

Your app is neat in that it can record from the Lock Screen. I was curious to try out the new open ai model.

Too often, iOS has a problem of too many clicks to do the most basic of things.

I wonder if you could train some machine learning model using the data from SponsorBlock and achieve good results on podcasts as well. That way you wouldn't be dependent on a crowdsourced online database for your offline listening. Alternatively, even creating a transscript using something like [1] and scanning for words like "sponsor", "ad" or specific company names might already be a good enough heuristic.

[1] https://github.com/alphacep/vosk-api

They might be interested in integrating Vosk, it's a speech-to-text engine that is just a shared library (.so file on Linux) and comes with API support for a variety of languages:

https://alphacephei.com/vosk/

https://github.com/alphacep/vosk-api

Still, I've found that the Big players have much better recognition models, and the post-processing that I assume they do (grammatical, maybe syntactical inferences that improve the end result) are probably much more powerful too.

Nerd dictation is a purely on-device speech to text program that works pretty well if your computer is fast enough.

https://github.com/ideasman42/nerd-dictation

get speech models here:

https://github.com/alphacep/vosk-api

HN discussion:

https://news.ycombinator.com/item?id=29972579

In case it's of interest, when I last explored this topic in terms of the Free/Open Source ecosystem I was very impressed with how well VOSK-API performed: https://github.com/alphacep/vosk-api

Here's another project that builds on top of VOSK to provide a tighter integration with Linux: https://github.com/ideasman42/nerd-dictation

I've never even heard of VOSK-API [0], the underlying offline speech to text engine that this project uses.

Does anyone have experience using it? Is it any good?

[0] https://github.com/alphacep/vosk-api

You could integrate vosk for local on-device private transcription. https://github.com/alphacep/vosk-api
Jasper is very old and not very accurate. For offline recognition on RPi try something like Vosk https://github.com/alphacep/vosk-api
Not a big problem given so many alternatives around like

E.g. some very active projects are:

* Kaldi (https://github.com/kaldi-asr/kaldi/) obviously, probably the most famous one, and most mature one. For standard hybrid NN-HMM models and also all their more recent lattice-free MMI (LF-MMI) models / training procedure. This is also heavily used in industry (not just research).

* ESPnet (https://github.com/espnet/espnet), for all kind of end-to-end models, like CTC, attention-based encoder-decoder (including Transformer), and transducer models.

* Espresso (https://github.com/freewym/espresso).

* Google Lingvo (https://github.com/tensorflow/lingvo). This is the open source release of Googles internal ASR system, and used by Google in production (their internal version of it, which is not too much different).

* NVIDIA OpenSeq2Seq (https://github.com/NVIDIA/OpenSeq2Seq).

* Facebook Fairseq (https://github.com/pytorch/fairseq). Attention-based encoder-decoder models mostly.

* Facebook wav2letter (https://github.com/facebookresearch/wav2letter). ASG model/training.

* Vosk (https://github.com/alphacep/vosk-api). Offline lightweight speech recognition API with support for 10 languages.

And there are much more.

Try https://github.com/alphacep/vosk-api. It supports 10 languages, works on Android and RPi and also has big and more accurate server models.

Other good ones are https://github.com/daanzu/kaldi-active-grammar and https://talonvoice.com/

There are toolkits for research like https://github.com/kaldi-asr/kaldi, https://github.com/espnet/espnet, wav2letter, Espresso, Nvidia/Nemo, https://github.com/didi/athena. You can try them too if you want to go deep. Some of them have interesting capabilities.

I develop kaldi-active-grammar [0]. The Kaldi engine itself is state of the art and open source, but is focused on research rather than usability. My project has a simple interface and comes with a pretty good open source speech model.

However, kaldi-active-grammar specializes in real time command and control, with advanced features that don't really apply to your use case. Vosk [1] is probably a simpler, better fit for you. It likewise uses Kaldi and can use my models, and offers some others of its own as well.

Neither are particularly focused on transcription per se, but they are open.

[0] https://github.com/daanzu/kaldi-active-grammar

[1] https://github.com/alphacep/vosk-api

Shameless plug, but I have been working on an open source IDE plugin [1] for the IntelliJ Platform which attempts to do this. Previously, we used an older HMM-based speech toolkit called CMUSphinx [2], but are currently transitioning to a deep speech recognition system. We also tried a number of cloud APIs including Amazon Lex and Google Cloud Speech, but they were too slow -- offline STT is really important for low latency UX applications. For navigation and voice typing, we need something customizable and fairly responsive. Custom grammars would be nice for various contexts and programming languages.

There are a few good OSS offline deep speech libraries including Mozilla DeepSpeech [3], but their resource footprint is too high. We settled on the currently less mature vosk [4], which is based on Kaldi [5] (a more popular deep speech pipeline), and includes a number of low-footprint, pretrained language models for real-time streaming inference. Research has shown how to deploy efficient deep speech models on CPUs [6], so we're hoping those gains will translate to faster performance on commodity laptops soon. You can follow this issue [7] for updates on our progress. Contributions are welcome!

[1]: https://github.com/OpenASR/idear/

[2]: https://cmusphinx.github.io/

[3]: https://github.com/mozilla/DeepSpeech

[4]: https://github.com/alphacep/vosk-api

[5]: https://github.com/kaldi-asr/kaldi

[6]: https://ai.facebook.com/blog/a-highly-efficient-real-time-te...

[7]: https://github.com/OpenASR/idear/issues/52

Would be nice to test something opensource alongside. Like https://github.com/alphacep/vosk-api which runs on Android and iPhone offline.
You are welcome to try Vosk

https://github.com/alphacep/vosk-api

Advantages are:

1) Supports 7 languages - English, German, French, Spanish, Portuguese, Chinese, Russian

2) Works offline even on lightweight devices - Raspberry Pi, Android, iOS

3) Install it with simple `pip install vosk`

4) Model size per language is just 50Mb

5) Provides streaming API for the best user experience (unlike popular speech-recognition python package)

6) There are APIs for different languages too - java/csharp etc.

7) Allows quick reconfiguration of vocabulary for best accuracy.

8) Supports speaker identification beside simple speech recognition

Ok, if you want to start with Kaldi it is probably easier to check kaldi-active-grammar mention above or https://github.com/alphacep/vosk-api