Given that OpenAI Whisper is open source now, and pretty near SOTA, I think creating an audio only open-source version of this shouldn't be difficult. However, I don't know how to easily contextualise the audio - how would I search 'name of the movie I was discussing with Zeynep last week'?

I got OpenAI Whisper running locally on my Mac but the plumbing to make it NOT tax system resources (like CPU) and to get it to work with search isn't trivial. It's on our roadmap.

You might find my inference implementation of Whisper useful [0]. It has a C-style API that allows for easy integration in other projects and you can control how many CPU threads to be used during the processing.

[0] https://github.com/ggerganov/whisper.cpp