What does HackerNews think of Whisper?

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model

Language: C++

Gamers don’t care about FP64 performance, and it seems nVidia is using that for market segmentation. The FP64 performance for RTX 4090 is 1.142 TFlops, for RTX 3090 Ti 0.524 TFlops. AMD doesn’t do that, FP64 performance is consistently better there, and have been this way for quite a few years. For example, the figure for 3090 Ti (a $2000 card from 2022) is similar to Radeon Vega 56, a $400 card from 2017 which can do 0.518 TFlops.

And another thing: nVidia forbids usage of GeForce cards in data centers, while AMD allows that. I don’t know how specifically they define datacenter, whether it’s enforceable, or whether it’s tested in courts of various jurisdictions. I just don’t want to find out answers to these questions at the legal expenses of my employer. I believe they would prefer to not cut corners like that.

I think nVidia only beats AMD due to the ecosystem: for GPGPU that’s CUDA (and especially the included first-party libraries like BLAS, FFT, DNN and others), also due to the support in popular libraries like TensorFlow. However, it’s not that hard to ignore the ecosystem, and instead write some compute shaders in HLSL. Here’s a non-trivial open-source project unrelated to CAE, where I managed to do just that with decent results: https://github.com/Const-me/Whisper That software even works on Linux, probably due to Valve’s work on DXVK 2.0 (a compatibility layer which implements D3D11 on top of Vulkan).

I just used Whisper over the weekend to transcribe 5 hours of meetings, worked nicely and it can be run on a single GPU locally. The best part? It's free. https://github.com/ggerganov/whisper.cpp

It took about 5 minutes to process each hour on a 1080ti GPU.

There are a few wrappers available with GUI like https://github.com/Const-me/Whisper

Yeah, the community always seems to figure out how to do things more effective.

My girlfriend asked me if I could transcribe some audio files for her with my "programming stuff". I immediately thought of Whisper from OpenAI.

I first used the official CLI tool. With the largest model it took long 8 hours to transcribe a 30 min long file. I noticed it was running on the CPU - tried switching it to use the GPU instead with no luck. Running it on WSL was probably not helping.

Then I found this gem: https://github.com/Const-me/Whisper A C++ Windows implementation of Whisper. I opened the program, fed it with the largest model and the file. The transcript was done in 4 minutes, instead of 8 hours... Downside? The program has a GUI, lol.

Of course, I could probably get the CLI tool to run on the GPU with some tinkering and installing some Nvidia packages for Whisper to use. But frankly, I have so little experience with that kind of stuff, that installing the Windows implementation was a much easier choice.

And there's a fork of that that uses DirectCompute to run it on GPUs without Cuda on Windows:

https://github.com/Const-me/Whisper

You could also port that code from nVidia CUDA somewhere else. If you wanna run on servers, Vulkan Compute will probably do the trick. I have ported ML inference to Windows desktops with DirectCompute https://github.com/Const-me/Whisper

These NVIDIA A40 GPUs mentioned on the page you’ve linked cost $4000 (Amazon) to $10000 (Dell), deliver similar performance to Intel A770 which cost $350. That’s 1 order of magnitude difference in cost efficiency.

> All they will ever offer is a paid prompt.

They sometimes open their older models: https://github.com/Const-me/Whisper

I’ve asked ChatGPT about hardware. Here’s the response:

> As for your specific computer, with 64 GB of RAM and a high-performance GPU like the GeForce 1080 Ti, it should have sufficient resources to run a language model like me for many common tasks.

Based on the models open sourced by OpenAI, they are using PyTorch and CUDA. This means their stack requires nVidia GPUs. I think the main reason for their high costs is a single sentence in the EULA of GeForce drivers: https://www.datacenterdynamics.com/en/news/nvidia-updates-ge...

It’s technically possible to port their GPGPU code from CUDA somewhere else. Here’s a vendor-agnostic DirectCompute re-implementation of their Whisper model: https://github.com/Const-me/Whisper

On servers, DirectCompute is not great ‘coz Windows server licenses are expensive. Still, I did that port alone, and spent couple weeks doing that.

OpenAI probably has resources to port their inference to vendor-agnostic Vulkan Compute, running on Linux servers equipped with reasonably-priced AMD or Intel GPUs. For instance, Intel A770 16GB only costs $350, but delivers similar performance to nVidia A30 which costs $16000. Intel consumes more electricity but not by much, 225W versus 165W. That’s like 40x difference in cost efficiency of that chat.

My implementation of Whisper uses slightly over 4GB VRAM running their large multilingual model: https://github.com/Const-me/Whisper