Is there any tutorial on how to use HuggingFace LLaMA 2-derived models? They don't have checkpoint files of the original LLaMA and can't be used by the Meta's provided inference code, instead they use .bin files. I am only interested in Python code so no llama.cpp.

I'd reconsider your rejection of llama.cpp if I were you. You can always call out to it from Python, but llama.cpp is by far the most active project in this space, and they've gotten the UX to the point where it's extremely simple to use.

This user on HuggingFace has all the models ready to go in GGML format and quantized at various sizes, which saves a lot of bandwidth:

https://huggingface.co/TheBloke

treprinum

I understand, I use llama.cpp for my own personal stuff but can't override the policy on the project I want to plug it in, which is python-only.

pests

There was a post yesterday about a 500 line single-file C implmenetation of llama2 with no dependencies. The llama2 architecture is hard coded. It shouldn't be too hard to port to python.

Found the repo, couldn't easily find the HN thread.

https://github.com/karpathy/llama2.c