Since this is pytorch it should run on cpu anyway. What am I missing?

I guess the simple fact that it didn't before his patch?

Usually you just trivially have the model run on cpu or gpu by simply writing .cpu() at specific places, so he's wondering why this isn't the case here.

that's literally all I did (plus switching the tensor type). I'd imagine people are posting and upvoting this not because it's actually interesting code but rather just because it runs unexpectedly fast on consumer CPUs and it's not something they considered feasible before.

How are you getting this to run fast? I'm on a top of the line M1 MBP and getting 1 token every 8 minutes.

Try switching all the .cuda() to .mps() I got a 100x speedup on a different language model on a Macbook M1 Air.

https://pytorch.org/docs/stable/notes/mps.html