What does HackerNews think of willow?
Open source, local, and self-hosted Amazon Echo/Google Home competitive Voice Assistant alternative
Purchased one of the esp32 boxes, but haven’t had the time to set everything up yet.
First of all, we love seeing efforts like this and we'd love to work together with other open source voice user interface projects! There's plenty of work to do in the space...
I have roughly two decades of experience with voice and one thing to keep in mind is how latency sensitive voice tasks are. Generally speaking when it comes to conversational audio people have very high expectations regarding interactivity. For example, in the VoIP world we know that conversation between people starts getting annoying at around 300ms of latency. Higher latencies for voice assistant tasks are more-or-less "tolerated" but latency still needs to be extremely low. Alexa/Echo (with all of its problems) is at least a decent benchmark for what people expect for interactivity and all things considered it does pretty well.
I know you're early (we are too!) but in your demo I counted roughly six seconds of latency between the initial hello and response (and nearly 10 for "tell me a joke"). In terms of conversational voice this feels like an eternity. Again, no shade at all (believe me I understand more than most) but just something I thought I'd add from my decades of experience with humans and voice. This is why we have such heavy emphasis on reducing latency as much as possible.
For an idea of just how much we emphasize this you can try our WebRTC demo[1] which can do end-to-end (from click stop record in browser to ASR response) in a few hundred milliseconds (with Whisper large-v2 and beam size 5 - medium/1 is a fraction of that) including internet latency (it's hosted in Chicago, FYI).
Running locally with WIS and Willow we see less than 500ms from end of speech (on device VAD) to command execution completion and TTS response with platforms like Home Assistant. Granted this is with GPU so you could call it cheating but a $100 six year old Nvidia Pascal series GPU runs circles around the fastest CPUs for these tasks (STT and TTS - see benchmarks here[2]). Again, kind of cheating but my RTX 3090 at home drops this down to around 200ms - roughly half of that time is Home Assistant. It's my (somewhat controversial) personal opinion that GPUs are more-or-less a requirement (today) for Alexa/Echo competitive responsiveness.
Speaking of latency, I've been noticing a trend with Willow users regarding LLMs - they are very neat, cool, and interesting (our inference server[3] supports LLamA based LLMs) but they really aren't the right tool for these kinds of tasks. They have very high memory requirements (relatively speaking), require a lot of compute, and are very slow (again, relatively speaking). They also don't natively support the kinds of API call/response you need for most voice tasks. There are efforts out there to support this with LLMs but frankly I find the overall approach very strange. It seems that LLMs have sucked a lot of oxygen out of the room and people have forgotten (or never heard of) "good old fashioned" NLU/NLP approaches.
Have you considered an NLU/NLP engine like Rasa[4]? This is the approach we will be taking to implement this kind of functionality in a flexible and assistant platform/integration agnostic way. By the time you stack up VAD, STT, understanding user intent (while allowing flexible grammar), calling an API, execution, and TTS response latency starts to add up very, very quickly.
As one example, for "tell me a joke" Alexa does this in a few hundred milliseconds and I guarantee they're not using an LLM for this task - you can have a couple of hundred jokes to randomly select from with pre-generated TTS responses cached (as one path). Again, this is the approach we are taking to "catch up" with Alexa for all kinds of things from jokes to creating calendar entries, etc. Of course you can still have a catch-all to hand off to LLM for "conversation" but I'm not sure users actually want this for voice.
I may be misunderstanding your goals but just a few things I thought I would mention.
[0] - https://github.com/toverainc/willow
[1] - https://wisng.tovera.io/rtc/
[2] - https://github.com/toverainc/willow-inference-server/tree/wi...
[3] - https://github.com/toverainc/willow-inference-server
[4] - https://rasa.com/
What it is capable of (especially considering the price point) is nothing short of incredible. Wake word activation, audio processing (AGC, AEC, etc), audio streaming, even on device speech recognition for up to 400 commands with Multinet. All remarkably performant, easily besting Alexa/Echo in terms of interactivity and response time (even when using an inference server across the internet for speech recognition).
Sure we're down in ESP-IDF land and managing everything we have going on in FreeRTOS is a pain but that's not anything you wouldn't have on any microcontroller. We're also doing a lot considering and generally speaking we "just" throw/pin audio tasks (with varying priority) on core 1 while more-or-less dumping everything else on core 0. Seems to be working well so far!
In short you can:
1) Run a local Willow Inference Server[1]. Supports CPU or CUDA, just about the fastest implementation of Whisper out there for "real time" speech.
2) Run local command detection on device. We pull your Home Assistant entities on setup and define basic grammar for them but any English commands (up to 400) are supported. They are recognized directly on the $50 ESP BOX device and sent to Home Assistant (or openHAB, or a REST endpoint, etc) for processing.
Whether WIS or local our performance target is 500ms from end of speech to command executed.
People complain about the “Nvidia tax” but the hardware is superior (untouchable at datacenter scale) and the “tax” turns into a dividend as soon as your (very expensive) team spends hour after hour (week after week) dealing with issues on other platforms compared to anything based on CUDA often being a Docker pull away with absolutely first class support on any ML framework.
Nvidia gets a lot of shade on HN and elsewhere but if you’ve spent any time in this field you completely understand why they have 80-90% market share of GPGPU. With Willow[0] and the Willow Inference Server[1] I'm often asked by users with no experience in the space why we don't target AMD, Coral TPUs (don't even get me started), etc. It's almost impossible to understand "why CUDA" unless you've fought these battles and spent time with "alternatives".
I’ve been active in the space for roughly half a decade and when I look back to my early days I’m amazed what a beginner like me was able to do because of CUDA. I still routinely am. What you’re able to actually accomplish with a $1000 Nvidia card and a few lines with transformers and/or a Docker container is incredible.
That said I am really looking forward to Apple stepping it up here - I’ve given up on AMD ever getting it together on GPGPU and Intel (with Arc) is even further behind. The space needs some real competition somewhere.
Paired with projects like https://github.com/toverainc/willow I think you could recreate the Enterprise at home if you have enough IoT stuff in your home.