What does HackerNews think of llama-cpp-python?

Python bindings for llama.cpp

Language: Python

https://github.com/abetlen/llama-cpp-python has a web server mode that replicates openai's API iirc and the readme shows it has docker builds already.
I see you’re using gpt4all; do you have a supported way to change the model being used for local inference?

A number of apps that are designed for OpenAI’s completion/chat APIs can simply point to the endpoints served by llama-cpp-python [0], and function in (largely) the same way, while using the various models and quants supported by llama.cpp. That would allow folks to run larger models on the hardware of their choice (including Apple Silicon with Metal acceleration or NVIDIA GPUs) or using other proxies like openrouter.io. I enjoy openrouter.io myself because it supports Anthropic’s 100k models.

[0]: https://github.com/abetlen/llama-cpp-python

Hi I was also looking into this and I am now using: https://github.com/abetlen/llama-cpp-python It tries to be compatible with openAI API. I managed to run AutoGPT using it (however context window is too small to be useful and even if I set it to 2048 (max) I had to tweak AutoGPT context maximum as 1024 for it to work - probably some additional wrapping or something)
You can see for yourself (assuming you have the model weights) https://github.com/abetlen/llama-cpp-python

I get around ~140 ms per token running a 13B parameter model on a thinkpad laptop with a 14 core Intel i7-9750 processor. Because it's CPU inference the initial prompt processing takes longer than on GPU so total latency is still higher than I'd like. I'm working on some caching solutions that should make this bareable for things like chat.