What does HackerNews think of llama?

Inference code for LLaMA models

Language: Python

Is it possible to execute a light weight language model, perhaps this https://github.com/facebookresearch/llama using ExecuTorch to run on smartphone in real time for a chatbot app ? Please share some guidance.
I have a question: Last week I downloaded llama-7b-chat from meta's github directly (https://github.com/facebookresearch/llama) using the URL they sent via e-mail. As a result, I now have the model as consolidated.00.pth.

Your commands assume the model is a .bin file (so I guess there must be a way to convert the pytorch model .pth to the .bin file). How can I do this and what is the difference between the two models?

The facebook repo provides commands for using the models, these commands don't work on my windows machine: "NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to ...."

The facebook repo does not describe which OS you are supposed to use, so I assumed it would work on Windows too. But then if this can work why would anyone need the ggerganov llama code? I am new to all of this and easily confused, so any help is appreciated

I filled the form about an hour ago and got the download link 15 mins ago. Download is ongoing.

Direct link to request access form: https://ai.meta.com/resources/models-and-libraries/llama-dow...

Direct link to request access on Hugging Face (use the same email): https://huggingface.co/meta-llama/Llama-2-70b-chat-hf

Direct link to repo: https://github.com/facebookresearch/llama

Once you get a link to download on email make sure to copy it without spaces, an option is to open it in a new tab and then download. If you are using fish or another fancy shell, make sure you switch to bash or sh before running download.sh from the repo.

I am not sure exactly how much space is needed but it is likely north of 500GB given that there are two 70B models (you are given the option to download just the small ones in a prompt).

Edit: The_Bloke on HF already has them in GGML format available for download.

https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

The llama source code in the original repo has been updated for llama 2: https://github.com/facebookresearch/llama
The model isn't code to a new model trained on it, it's training data; just like the pirated torrent site Books3 dataset Facebook used to train LLaMA.

The training code is Apache 2.0 licensed so it can be copied and modified freely, including for commercial purpoes. https://github.com/facebookresearch/llama

The link from GP is the CPU only one implemented in C++.

The python + GPU one can be found on the official facebook repo: https://github.com/facebookresearch/llama (Presumably GP thought this was already known to everyone so they pasted the other link)

The code (including the model) is here:

https://github.com/facebookresearch/llama

I already got the 7B model to generate text using my GPU! The 1st example prompt generated this:

[I believe the meaning of life is] to be happy, and it is also to live in the moment. I think that is the most important thing. I'm not really a party girl. I'm not a girl's girl. I have a really small group of close girlfriends and that's all I need. I believe in equal rights for everyone. I'm not a rebel. I don't really rebel against anything. I'm a very traditional girl, very loyal. I'm a mum's girl and I'm a dad's girl. People have a right to know what's going on. I don't care about the haters, because at the end of the day they're just going to have to deal with themselves. I've been getting more and more into fashion since I was about 16. I know I'm a little different, but so what? I think that's good. I don't think you should be like everyone else. It's my birthday, and I'll cry if I want to. I've always been a huge fan of fashion, and I've always liked to dress up

> so does this mean you got it working on one GPU with an NVLink to a 2nd, or is it really running on all 4 A40s?

it's sharded across all 4 GPUs (as per the readme here: https://github.com/facebookresearch/llama). I'd wait a few weeks to a month for people to settle on a solution for running the model, people are just going to be throwing pytorch code at the wall and seeing what sticks right now.

> To maintain integrity and prevent misuse, we are releasing our model under a noncommercial license focused on research use cases. Access to the model will be granted on a case-by-case basis to academic researchers; those affiliated with organizations in government, civil society, and academia; and industry research laboratories around the world. People interested in applying for access can find the link to the application in our research paper.

The closest you are going to get to the source is here: https://github.com/facebookresearch/llama

It is still unclear if you are even going to get access to the entire model as open source. Even if you did, you can't use it for your commercial product anyway.