What does HackerNews think of llama-mps?

Anthropic’s $5B, 4-year plan to take on OpenAI | Apr 2023

The perf results I was referring to was the ability to run an llm locally (like llama.cpp) that uses a giant amount of ram in the gpu, like 40gig. Without this uniform memory model, you end up paging endlessly, so it's actually much faster for this application in this scenario. Unlike on a pc with a graphics card, you can use your entire ram for gpu. This isn't possible on the xbox because it doesn't have uniform memory as far as I know. So having incredible throughput still won't match not having to page.

Edit - I found an example from h.n. user anentropic, pointing at https://github.com/remixer-dec/llama-mps . "The goal of this fork is to use GPU acceleration on Apple M1/M2 devices.... After the model is loaded, inference for max_gen_len=20 takes about 3 seconds on a 24-core M1 Max vs 12+ minutes on a CPU (running on a single core). "

Anthropic’s $5B, 4-year plan to take on OpenAI | Apr 2023

Expand Context ↕

Check out the LLaMA memory requirements on Apple Silicon GPU here: https://github.com/remixer-dec/llama-mps

Anthropic’s $5B, 4-year plan to take on OpenAI | Apr 2023

Expand Context ↕

The unified memory ought to be great for running LLaMA on the GPU on these Macbooks (since it can't run on the Neural Engine currently)

The point of llama.cpp is most people don't have a GPU with enough RAM, Apple unified memory ought to solve that

Some people have it working apparently:

https://github.com/remixer-dec/llama-mps

LLaMA-7B in Pure C++ with full Apple Silicon support | Mar 2023

Expand Context ↕

There is also a gpu-acelerated fork of the original repo

https://github.com/remixer-dec/llama-mps

Fork of Facebook’s LLaMa model to run on CPU | Mar 2023

Expand Context ↕

dedicated fork: https://github.com/remixer-dec/llama-mps

Fork of Facebook’s LLaMa model to run on CPU | Mar 2023

Expand Context ↕

dedicated fork: https://github.com/remixer-dec/llama-mps