If I were Apple I'd be thinking about the following issues with that strategy:

1. That RAM isn't empty, it's being used by apps and the OS. Fill up 64GB of RAM with an LLM and there's nothing left for anything else.

2. 64GB probably isn't enough for competitive LLMs anyway.

3. Inferencing is extremely energy intensive, but the MacBook / Apple Silicon brand is partly about long battery life.

4. Weights are expensive to produce and valuable IP, but hard to protect on the client unless you do a lot of work with encrypted memory.

5. Even if a high end MacBook can do local inferencing, the iPhone won't and it's the iPhone that matters.

6. You might want to fine tune models based on your personal data and history, but training is different to inference and best done in the cloud overnight (probably?).

7. Apple already has all that stuff worked out for Siri, which is a cloud service, not a local service, even though it'd be easier to run locally than an LLM.

And lots more issues with doing it all locally, fun though that is to play with for developers.

I hope I'm wrong, it'd be cool to have LLMs be fully local, but it's hard to see situations where the local approach beats out the cloud approach. One possibility is simply cost: if your device does it, you pay for the hardware, if a cloud does it, you have to pay for that hardware again via subscription.

matthewdgreen

How much does 64GB of RAM cost, anyway? Retail it's like $200, and I'm sure it's cheaper in terms of Apple cost. Yet we treat it as an absurd luxury because Apple makes you buy the top-end 16" Macbook and pay an extra $800 beyond that. Maybe in the future they'll treat RAM as a requirement and not a luxury good.

anentropic

and we know that more will be cheaper in future

rnk

With the integrated ram and cpu and gpu on apple silicon, however it's done it yields perf results. I do think that probably has higher cost than separately produced ram. And even separate from that, because they have that unified memory model unlike every other consumer device they can charge for it. So 64, 96 or 128 gb?

rasz

Its not done for perf results, Xbox doesnt have ram on package and somehow does 560 GB/s

rnk

The perf results I was referring to was the ability to run an llm locally (like llama.cpp) that uses a giant amount of ram in the gpu, like 40gig. Without this uniform memory model, you end up paging endlessly, so it's actually much faster for this application in this scenario. Unlike on a pc with a graphics card, you can use your entire ram for gpu. This isn't possible on the xbox because it doesn't have uniform memory as far as I know. So having incredible throughput still won't match not having to page.

Edit - I found an example from h.n. user anentropic, pointing at https://github.com/remixer-dec/llama-mps . "The goal of this fork is to use GPU acceleration on Apple M1/M2 devices.... After the model is loaded, inference for max_gen_len=20 takes about 3 seconds on a 24-core M1 Max vs 12+ minutes on a CPU (running on a single core). "