For my fellow Windows shills, here's how you actually build it on windows:
Before steps:
1. (For Nvidia GPU users) Install cuda toolkit https://developer.nvidia.com/cuda-downloads
2. Download the model somewhere: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolv...
In Windows Terminal with Powershell:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
cd bin/Release
mkdir models
mv Folder\Where\You\Downloaded\The\Model .\models
.\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin --color -p "Hello, how are you, llama?" 2> $null
`-DLLAMA_CUBLAS` uses cuda`2> $null` is to direct the debug messages printed to stderr to a null file so they don't spam your terminal
Here's a powershell function you can put in your $PSPROFILE so that you can just run prompts with `llama "prompt goes here"`:
function llama {
.\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin -p $args 2> $null
}
adjust your paths as necessary. It has a tendency to talk to itself.Your commands assume the model is a .bin file (so I guess there must be a way to convert the pytorch model .pth to the .bin file). How can I do this and what is the difference between the two models?
The facebook repo provides commands for using the models, these commands don't work on my windows machine: "NOTE: Redirects are currently not supported in Windows or MacOs. [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to ...."
The facebook repo does not describe which OS you are supposed to use, so I assumed it would work on Windows too. But then if this can work why would anyone need the ggerganov llama code? I am new to all of this and easily confused, so any help is appreciated