A quick survey of the thread seems to indicate the 7b parameter LLaMA model does about 20 tokens per second (~4 words per second) on a base model M1 Pro, by taking advantage of Apple Silicon’s Neural Engine.

Note that the latest model iPhones ship with a Neural Engine of similar performance to latest model M-series MacBooks (both iPhone 14 Pro and M1 MacBook Pro claim 15.8 teraflops on their respective neural engines; it might be the same exact component in each chip). All iPhone 14 models sport 6GB integrated RAM; the MacBook starts at 8GB. All of the specs indicate an iPhone 14 Pro could achieve similar throughput to an M1 MacBook Pro.

Some people have already had success porting Whisper to the Neural Engine, and as of 14 hours ago GGerganov (the guy who made this port of LLaMA to the Neural Engine and who made the port of Whisper to C++) posted a GitHub comment indicating he will be working on that in the next few weeks.

So. With Whisper and LLaMA on the Neural Engine both showing better than real-time performance, and Apple’s own pre-existing Siri Neural TTS, it looks like we have all the pieces needed to make a ChatGPT-level assistant operate entirely through voice and run entirely on your phone. This is absolutely extraordinary stuff!