What does HackerNews think of llama-cpp-python?

Code Llama, a state-of-the-art large language model for coding | Aug 2023

https://github.com/abetlen/llama-cpp-python has a web server mode that replicates openai's API iirc and the readme shows it has docker builds already.

Show HN: Khoj – Chat Offline with Your Second Brain Using Llama 2 | Jul 2023

I see you’re using gpt4all; do you have a supported way to change the model being used for local inference?

A number of apps that are designed for OpenAI’s completion/chat APIs can simply point to the endpoints served by llama-cpp-python [0], and function in (largely) the same way, while using the various models and quants supported by llama.cpp. That would allow folks to run larger models on the hardware of their choice (including Apple Silicon with Metal acceleration or NVIDIA GPUs) or using other proxies like openrouter.io. I enjoy openrouter.io myself because it supports Anthropic’s 100k models.

[0]: https://github.com/abetlen/llama-cpp-python

How to run Llama 13B with a 6GB graphics card | May 2023

Expand Context ↕

I haven't tried that but https://github.com/abetlen/llama-cpp-python and https://github.com/r2d4/openlm exists

GPT4All Chat – Locally-running AI chat application powered by the GPT4All-J | Apr 2023

Expand Context ↕

Hi I was also looking into this and I am now using: https://github.com/abetlen/llama-cpp-python It tries to be compatible with openAI API. I managed to run AutoGPT using it (however context window is too small to be useful and even if I set it to 2048 (max) I had to tweak AutoGPT context maximum as 1024 for it to work - probably some additional wrapping or something)

The Coming of Local LLMs | Apr 2023

Expand Context ↕

You can see for yourself (assuming you have the model weights) https://github.com/abetlen/llama-cpp-python

I get around ~140 ms per token running a 13B parameter model on a thinkpad laptop with a 14 core Intel i7-9750 processor. Because it's CPU inference the initial prompt processing takes longer than on GPU so total latency is still higher than I'd like. I'm working on some caching solutions that should make this bareable for things like chat.