What does HackerNews think of whisper.cpp?

Port of OpenAI's Whisper model in C/C++

Language: C

https://github.com/ggerganov/whisper.cpp

https://github.com/Const-me/Whisper

I had fun with both of these. They will both do realtime transcription. Bit you will have to download the training data sets…

It runs locally, using Whisper.cpp[1], a Whisper implementation optimized to run on CPU, especially Apple Silicon.

Whisper itself is open source, and so is that implementation, the OpenAI endpoint is merely a convenience to those who don't wish to host a Whisper server themselves, deal with batching, renting GPUs etc. If you're making a commercial service based on Whisper, the API might be worth it for the convenience, but if you're running it personally and have a good enough machine (an M1 MacBook Air will do), running it locally is usually better.

[1] https://github.com/ggerganov/whisper.cpp

You can use Whisper to transcribe the audio to text locally on the mac.

You have a great Open-source implementation named whisper.cpp and a few graphical user interfaces for it:

https://github.com/ggerganov/whisper.cpp

https://sindresorhus.com/aiko

https://goodsnooze.gumroad.com/l/macwhisper

Personally I use MacWhisper pro because it’s very convenient.

whisper.ai is apparently something completely different, I’m pretty sure OP meant OpenAI’s whisper [0] which is mainly used with whisper.cpp [1] I think.

[0]: https://github.com/openai/whisper

[1]: https://github.com/ggerganov/whisper.cpp

Thanks!

Not sure if it'll work on an Arduino, but maybe take a look at https://github.com/ggerganov/whisper.cpp -- it works on a Raspberry Pi at least, so resource requirements are fairly minimal

This looks awesome. My only nitpick is, I will suggest transcription integration with whisper.cpp[1], which in my simple CPU based tests (likely your most user base), works much much faster compared to OpenAI whisper

[1] https://github.com/ggerganov/whisper.cpp

> Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.

He has already done great work here: https://github.com/ggerganov/whisper.cpp

I'm a huge fan of Georgi (the author)! You should also check out his other work, bringing Apple Silicon support to OpenAI's Whisper (speech-to-text model): https://github.com/ggerganov/whisper.cpp
Super cool project. This is from the author of whisper.cpp, which enables highly accurate real-time audio transcription on the M1/M2:

https://github.com/ggerganov/whisper.cpp

Just download whisper ....

If you own a gpu use this one https://github.com/openai/whisper

If you don't own a gpu use this one https://github.com/ggerganov/whisper.cpp (this one is very very slow)

I've ran Whisper locally via [1] with one of the medium sized models and it was damn good at transcribing audio from a video of two people having a conversation.

I don't know exactly what the use case is where people would need to run this via API; the compute isn't huge, I used CPU only (an M1) and the memory requirements aren't much.

[1] https://github.com/ggerganov/whisper.cpp

I recently tried a number of options for streaming STT. Because my use case was very sensitive to latency, I ultimately went with https://deepgram.com/ - but https://github.com/ggerganov/whisper.cpp provided a great stepping stone while prototyping a streaming use case locally on a laptop.
You can run Whisper in WASM (locally) so no need to pay for the API, plus the bandwidth. It actually works surprisingly well: https://github.com/ggerganov/whisper.cpp
While doing my PhD some years ago (it wasn't a PhD on AI, but very much related) I trained several models with the usual stack back then (pytorch and some others in TF). I realized that a lot of this stack could be rewritten in much simpler terms without sacrificing much fidelity and/or performance in the end.

Submissions like yours and other projects like this one (recently featured here as well) -> https://github.com/ggerganov/whisper.cpp, makes it pretty clear to me that this intuition is correct.

There's a couple tools I created back then that could push things further towards this direction, unfortunately they're not mature enough to warrant a release but the ideas they portray are worth taking a look at (IMHO) and I'll be happy to share them. If there's interest on your side (or anyone reading this thread) I'd love to talk more about it.

> vs the serverside systems

I believe this runs client side, but whether it counts as open source is likely open for debate:

https://github.com/ggerganov/whisper.cpp

I have a pet hate, it's voice notes in WhatsApp or Telegram. Quite often the voice notes remain unheard for hours, due to the call to action (the notification) not letting me see what I need to react to, or if I'm in meetings and cannot listen for a period of time.

There are paid for services which can transcode speech to text but none free I could find. With the release of Whisper this has become something I thought could be solved with some minimal coding.

While Whisper relies on GPU's, Whisper.cpp does not and can run on a CPU with 1Gb ram (about 500mb for the model) enter the Pi 4.

I wrote a telegram bot in Python using python-telegram-bot which calls whisper.cpp to transcode speech to text. Here's my bot which is open to all, but you could start your own, with a Pi 4 and an always up connection, you can leave it running for when you need it.

Due to the constraints on the Pi 4, it only runs the English model and may result in errors for other languages.

Check my bot out here https://web.telegram.org/k/#@shhhhhhhhhhhhhhhhh_bot Check out Whipser here https://openai.com/blog/whisper/

Check out Whipser.cpp here https://github.com/ggerganov/whisper.cpp

There are various size options for the models. The choice of which you use trades off accuracy for higher performance.

There is also a C++ re-implementation that performs well and can definitely transcribe in realtime on many machines: https://github.com/ggerganov/whisper.cpp

Read all the leading papers, many times, to get a deep understanding, the writing quality is usually pretty low, but the information density can be very high, you'll probably miss the important details the first time.

Most medium and low-quality papers are full of errors and noise, but you can still learn from them.

Get your hands dirty with real code.

I would take a look at those:

https://github.com/geohot/tinygrad

https://github.com/ggerganov/whisper.cpp

It's one of the reasons I recently ported the Whisper model to plain C/C++. You just clone the repo, run `make [model]` and you are ready to go. No Python, no frameworks, no packages - plain and simple.

https://github.com/ggerganov/whisper.cpp

Not a hack per se but a complete reimplementation.

https://github.com/ggerganov/whisper.cpp

This is a C/C++ version of Whisper which uses the CPU. It's astoundingly fast. Maybe it won't work in your use case, but you should try!

This is a cool project. I’ve been very happy with whisper as an alternative to otter; it works better and solves real problems for me.

I feel compelled to point out whisper.cpp. It may be cheaper for the author but is relevant for others.

I was running whisper on a gtx 1070 to get decent performance; it was terribly slow on M1 Mac. Whisper.cpp has comparable performance to the 1070 while running on M1 CPU. It is easy to build and run and well documented.

https://github.com/ggerganov/whisper.cpp

I hope this doesn’t come off the wrong way, I love this project and I’m glad to see the technology democratized. Easily accessible high-quality transcription will be a game changer for many people and organizations.

You might find my inference implementation of Whisper useful [0]. It has a C-style API that allows for easy integration in other projects and you can control how many CPU threads to be used during the processing.

[0] https://github.com/ggerganov/whisper.cpp

On M1 Pro, with Greedy decoder and medium model, I can transcribe 1 hour audio in just 10 minutes (~x6 real-time) [0].

[0] https://github.com/ggerganov/whisper.cpp

You can try my C/C++ port of Whisper:

https://github.com/ggerganov/whisper.cpp

No dependencies, no Python, runs efficiently on the CPU.