What does HackerNews think of faster-whisper?

Faster Whisper transcription with CTranslate2

Language: Python

#14 in Deep learning
One caveat here is that whisper.cpp does not offer any CUDA support at all, acceleration is only available for Apple Silicon.

If you have Nvidia hardware the ctranslate2 based faster-whisper is very very fast: https://github.com/guillaumekln/faster-whisper

https://transcript.fish

I have been working on this podcast transcription project for a couple months and it's been super rewarding.

I listen to a podcast called No Such Thing As A Fish[0], where some researchers talk about their favorite facts they learned that week. Then they riff on it and are generally smart and funny. I listened to the series so many times that I decided I wanted to listen to the show on shuffle, not at the episode level, but at the fact level.

Since I have been playing around with whisper.cpp in python this seemed like a perfect way to combine some technologies I've been wanting to play with.

I ran whisper[1] over the entire podcast and transcribed all the episodes. I had to do this multiple times because I kept messing up. It eventually took like 7 straight days of my M1 processing to get through ~490 episodes using the medium.en model.

4 million words, and an 800Mb SQLite database later, I got the transcriptions done and have put up a nice site for searching through the data.

Now I just need to figure out the rest. Breaking it up into facts. Getting the audio working. Highlighting and linking to words, phrases, etc.

Some cool info about the process so far:

1. The SQLite database is chunked up and stored as static files, and the frontend queries the static files directly using HTTP range requests, so it only downloads a couple hundred kbs when querying.

2. I've been proper using ChatGPT 3.5 free version to help me write python and SQL. It's been pretty game changing as I feel basically no pain from not knowing what I'm doing.

The code is here: https://github.com/noman-land/transcript.fish

Please help if you know how to get whisper speaker diarization working!! I would really appreciate the help.

Also any tips on ways to index[2] or search[3] my database that will be super efficient would be helpful. Indexing matters a lot when querying the database in ranges like this... I have learned...

[0] https://www.nosuchthingasafish.com/

[1] https://github.com/guillaumekln/faster-whisper

[2] https://github.com/noman-land/transcript.fish/blob/maste/db/...

[3] https://github.com/noman-land/transcript.fish/blob/master/sr...

Comparison by competitor but it’s believable IMO. Basically about the same performance as whisper:

- https://deepgram.com/learn/nova-speech-to-text-whisper-api

Not surprising though as at this level all these options are starting to be leveled by inconsistencies in manual groundtruth. Conformer alone also isn’t the most powerful architecture out there for speech. This is also slower than, say running a large k2 zipformer via onnx on cpu.

Also if you have a small shop at this point you can do all of this yourself with whisper large v2 on a single 16gb gpu via some tweaking of https://github.com/guillaumekln/faster-whisper and an OSS LLM.

Interesting stuff but I think margins in this space are getting ready to simply vanish.

I've been looking for faster implementations of Whisper, the main drawback with Whisper Jax is that the performance comes from running on Google TPUs, which are much more expensive than GPUs.

On "normal" GPUs the fastest implementation I've found is https://github.com/guillaumekln/faster-whisper. Whisper.cpp works faster on a CPU, especially on Apple Silicon, but still nowhere near the performance you could get on a GPU (understandably).

How does Whisper Jax compares to faster-whisper on a GPU?

Faster Whisper is 8x faster than real time on CPU and even faster on GPU. https://github.com/guillaumekln/faster-whisper

Vocode uses Whisper for real-time zero latency voicechat with chatGPT. Give their demo line a call to see how well it works: +1-650-729-9536