What does HackerNews think of whisper?

Robust Speech Recognition via Large-Scale Weak Supervision

Language: Jupyter Notebook

The current voice transcription engine at OpenAI is using Whisper-1[0] which is open source and runnable locally, if you wanted to keep it all on-device. I run it locally for various things and it works pretty damn well.

[0] https://github.com/openai/whisper

Check out Whisper, it is surprisingly good and several of the output formats include time codes, works with multiple languages.


Is there any indication that USM will be open sourced though? This is more so competing with Whisper.


Yes. According to the OpenAI Whisper repo [0], when you use the 'whisper' command line tool:

"Adding --task translate will translate the speech into English."

I tried translating a conference with a German speaker. The transcription was superb, but the translation no so much.

[0] https://github.com/openai/whisper

> Machine translation is hilarious at best and dangerously wrong at worst.

I picked a random passage from a novel in French I am currently reading. ChatGPT translated the three paragraphs I ran it on correctly; there are no major quibbles to be had. It is good, coherent English, a correct translation, which closely follows the French original, even capturing some of the poetic imagery effectively.

I'm sure after another paragraph or two there will be a weird screw-up. And there's no consistency in a running translation of any length. Etc. Yes, it's not perfect. Not fully human-equivalent.

Still. I remember when machine translation like I just did was the realm of science fiction. And I thought it would remain science fiction for a long time. The fact that such a thing isn't mind-blowing shows how far things have come, hasn't it?

> Speech recognition - Siri still have major issues undestanding me.

I am using speech-to-text AI transcription every day. It's been revolutionary for me. I am hard of hearing. The cutting edge is Whisper, and it is leaps and bounds over the state-of-the-art just a year ago: https://github.com/openai/whisper

Whisper. Speech recognition and translation. Probably the best general-purpose speech recognition available right now. Certainly the best you can run on your own hardware. Open-sourced code and publicly free and available models. https://github.com/openai/whisper
Is there any chance you could expose a pathway to use a local instance of Whisper? I ask primarily because OpenAI completely open-sourced Whisper in September 2022[0]. It seems odd to me to default to or encourage the usage of a paid service for something that appears to be available for free under MIT license including models[1].

My understanding is that the only reason OpenAI even setup the paid API is because it "can also be hard to run [sic]". Personally, I'm skeptical. I"m not knocking them for it but I could see how this is just brand capitalization.

[0]: https://openai.com/blog/introducing-chatgpt-and-whisper-apis...

[1]: https://github.com/openai/whisper

Yes, you need to fine-tune the model with your data. This might be easy or hard, depending on your experience level and complexity of the model and available tooling.

For this model specifically (https://github.com/openai/whisper) it would be a significant challenge for a newcomer. Luckily Huggingface has a blog post that will get you started: https://huggingface.co/blog/fine-tune-whisper

They are pre-trained. This project is running a port of the original open AI release [0] to C++.

From the OpenAI paper and release notes: "We are releasing models and inference code to serve as a foundation for further work on robust speech processing." So I guess they are either truly altruistic in this, or they are planning on monetising whatever they build on top of it.

Also OpenAI is a startup (if we can call it that) so their value right now is more about being impressive, and looking a lot like future value; as opposed to showing an immediate route to profit.

[0] https://github.com/openai/whisper

I avoid needing to rely on my human memory (recall) when possible - if my phone is around, I make voice notes on it that are automatically synced to desktop which are then transcribed using Whisper[1].

When I do need to recall, I just use the strategy of "hold the items in my head as intently as possible, for as little time as possible until I'm able to get to my phone or some paper".

On the rare occasions when neither my phone or paper are immediately available, I try to visualize the ideas, projecting them onto a mental canvas, and try to use connections between them or mnemonics to remember them as best as I can.

I avoid the problem of "I sit down in front of my computer in order to write down or do a thing" by building the discipline necessary to prevent myself from getting distracted in that manner.

[1] https://github.com/openai/whisper

The magic is still there! In this case, the model for OpenAI's Whisper, which is arguably doing the bulk of the work here, is Open Source (under the MIT licence), and freely available for download at https://github.com/openai/whisper. You can run it wherever you want, though something with a GPU will let you do 5x realtime (or better!) transcription.
At this point in the SOTA, Whisper [0], can probably be a drop in replacement for the for-profit Nuance versions of Dragon, etc. It's even open source.

[0] https://github.com/openai/whisper

I would guess that they're using OpenAI's Whisper, which is open source: https://github.com/openai/whisper

It does speech-to-text, then you can use the full force of all the text analysis tools that are out there.

I've been thinking about wiring up whisper[0], mozilla's tts[1] and gpt-3 together to make a voice assistant of sorts. Wouldn't have the access to device hardware and no guarantees of correct answers, but should blow siri etc out of the water in terms of understanding the context.

[0] https://github.com/openai/whisper [1] https://github.com/mozilla/TTS

I mentioned Whisper because it works with a lot of languages. But I understand your confusion, because there are additional lightweight models that are only available for English. Its accuracy is less good for Hebrew, but instructional materials are likely optimal input.


> And it only works with English.

Depending on how you define "work", Whisper also works with Hebrew. Not sure if the word error rate is acceptable though https://github.com/openai/whisper/#available-models-and-lang...

They released CLIP (both model and code[1]), which is very broadly used in Dall-E alternatives. For example Stable Diffusion uses it.

They also release Whisper model and code[2]

[1] https://github.com/openai/CLIP

[2] https://github.com/openai/whisper

Truly proves the saying, "Get Woke, Go Broke". All this pearl-clutching over safety really did a disservice to them.

In all fairness, their release of Whisper[0] last week is actually really amazing. Like CLIP, it has the ability to spawn a lot of further research and work thanks to the open source aspect of it. I hope OpenAI learns from this, downgrades the "safety" shills, and focuses on producing more high-quality open source work, both code and models, which will move the field forward.

[0]: https://github.com/openai/whisper

> Neat, https://github.com/openai/whisper - they have open-sourced it, even the model weights, so they are living up to their name in this instance.

Perhaps it will encourage people to add voice command to their apps, which can be sent to gpt3