I love seeing lots of practical refutations of the "we have to do the voice processing in the cloud for performance" rationales peddled by the various home 1984 surveillance box vendors.

It's actually faster to do it locally. They want it tethered to the cloud for surveillance.

We can do either.

For "basic" command recognition the ESP SR (speech recognition) library supports up to 400 defined speech commands that run completely on the device. For most people this is plenty to control devices around the home, etc. Because it is all local it's extremely fast - as I said in another comment pushing "Did that really just happen?" fast.

However, for cases where someone wants to be able to throw any kind of random speech at it "Hey Willow what is the weather in Sofia, Bulgaria?" that's probably beyond the fundamental capabilities of a device with enclosure, display, mics, etc that sells for $50.

That's why we plan to support any of the STT/TTS modules provided by Home Assistant to run on local Raspberry Pis or wherever they host HA. Additionally, we're open sourcing our extremely fast highly optimized Whisper/LLM/TTS inference server next week so people can self host that wherever they want.

java_beyb

first, good initiative! thanks for sharing. i think you gotta be more diligent and careful with the problem statement.

checking the weather in Sofia, Bulgaria requires cloud, current information. it's not "random speech". ESP SR capability issues don't mean that you cannot process it locally.

the comment was on "voice processing" i.e. sending speech to the cloud, not sending a call request to get the weather information.

besides, local intent detection, beyond 400 commands, there are great local STT options, working better than most cloud STTs for "random speech"

https://github.com/alphacep/vosk-api https://picovoice.ai/platform/cheetah/