What does HackerNews think of NeMo?

Offline Voice Assistant on a Microcontroller with 192KB RAM | Dec 2022

With how hard something like this is, we should be looking at the positive side.

Picovoice here solves both the issues of hotword/wake-word detection and intent extraction. This looks like something you could build on top of ARM's [keyword spotting program](https://github.com/ARM-software/ML-KWS-for-MCU) and the wake word services listed in [Rhasspy's docs](https://rhasspy.readthedocs.io/en/latest/wake-word/#raven)

But, to implement something like this from scratch would take a good while.

Also, here are some Automatic Speech Recognition toolkits (which won't run offline on a microcontroller) out there. These are useful to pipe the data into a program that deals with intents (something like [RASA](https://rasa.com)

(Require Internet) * [Deepgram](https://deepgram.com) - I believe they build upon OpenAI's Whisper model and have their own custom models too * Google Cloud / Microsoft Azure / AWS / IBM Watson

(Can be run Offline) * [OpenAI's Whisper](https://github.com/openai/whisper) * [nVidia's NEMO](https://github.com/NVIDIA/NeMo) * [PaddleSpeech](https://github.com/PaddlePaddle/PaddleSpeech)

When you see how complicated the space is and how many ways you can actually shoot yourself in the foot. This post starts to look a wee bit better.

Whisper – open source speech recognition by OpenAI | Sep 2022

Expand Context ↕

Out of interest, did you try Nemo? https://github.com/NVIDIA/NeMo

Speech Synthesis on Linux (2020) | Sep 2021

I worked with this a bit not that long ago. For cloud services, quality of Google and Azure "neural" voices are tough to beat. Interestingly I experienced significant latency for all of the Azure services regardless of region, configuration, etc. Never dug deep enough to figure out what was going on there. Also of note, Azure will let you run their implementation on a local container with the usual "contact us" stuff. Not sure of the terms and pricing on that.

For local, Mozilla TTS was best from a quality standpoint but the GPU inference support was a bit dicey and (possibly) not really supported at all.

For more complex and bespoke applications the Nvidia (I know, I know) NeMO toolkit [0] is very powerful but requires more effort than most to get up and running. However, it provides the ability to do very interesting things with additional training and all things speech.

In the Nvidia world there's also their Riva [1] (formerly Jarvis) solution that works with Triton [2] to build out an architecture for extremely performant and high-scale speech applications with things like model management, revision control, deployment, etc.

[0] https://github.com/NVIDIA/NeMo

[1] https://developer.nvidia.com/riva

[2] https://developer.nvidia.com/nvidia-triton-inference-server

Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours of Speech | Aug 2021

Expand Context ↕

https://github.com/NVIDIA/NeMo which is open source, Pytorch based and regularly publishes new models and checkpoints.

Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours of Speech | Aug 2021

Expand Context ↕

NVidia NeMo: https://github.com/NVIDIA/NeMo

Microsoft buys Nuance for nearly $20B | Apr 2021

Expand Context ↕

plenty open-source options are available. Checkout https://github.com/NVIDIA/NeMo

Standardizing OpenAI’s deep learning framework on PyTorch | Jan 2020

Expand Context ↕

If you want type safety on top of Pytorch, checkout this project https://github.com/NVIDIA/NeMo

DeepSpeech 0.6 | Dec 2019

Expand Context ↕

NVIDIA has QuartzNet which contains only 19M weights and achieves 3.9% on test clean without language model and less then 3 with LM. Code (Pytorch): https://github.com/NVIDIA/NeMo Paper: https://arxiv.org/pdf/1910.10261.pdf