I would care more about LLaMA architecture when I can get hands on, honestly this project is more interesting and lighting fast on even a 2060 laptop https://github.com/BlinkDL/RWKV-LM

Has anyone tried this on an m1 machine?

fswd

I've ran it on a AMD 3950 which I think is half the speed of a M1, and it's plenty fast. Note I am specifically talked about RWKV

It would be interesting to see a version of RWKV[1] that takes some of the improvements in LLaMA (eg the SwiGLU activation function and the Rotary Embeddings - although I think they have tried rotary embeddings in some versions of RWKV) as well as the same dataset and see how it does.

The dataset is interesting. It's not dissimllar to The Pile, which RWKV is already trained on, but does seem to have quite a lot more preprocessing to increase the dataset quality.

[1] https://github.com/BlinkDL/RWKV-LM