If I'm not mistaken, they use a fork of an old version of llama.cpp

I've been running Vicuna locally for several days now using llama.cpp (i.e. CPU only, because my laptop lacks a good GPU). It's not that hard to set it up yourself from scratch. Compiling llama.cpp is straightforward under Linux. The model (13B parameters, 4 bit) can be downloaded from HuggingFace.

There are several difficulties, however:

1) Vanilla llama.cpp doesn't appear to have a simple-to-use interface to interact with it from another process (your app), such as REST API (an idea found in Fabrice Bellard's text-synth) -- you want your main app to be decoupled from a process which consumes a lot of CPU and RAM and can crash. I solved it by using llamacpphtmld project which provides HTTP access to the model. It's a pretty simple project and I think of making my own Go wrappers.

2) Vicuna was trained on ChatGPT output so it often responds with garbage such as "As an AI language model..." and refuses to discuss "controversial topics". I solved it with prompt engineering where I add things like "%bot_name% never moralizes", "%bot_name% is rude" etc. (it's not always rude but that somehow stops it from moralizing)

3) it's pretty slow, at least on my laptop (1-3 minutes to process a phrase -- although I include dialog history as well)

So far I'm very pleased with the results (aside from the fact that it's very slow) -- the bot feels GPT3-level when you use it as a chatbot, or as a story generator. In fact, in my tests, it feels like it actually exceeds GPT3 in that regard. I run two bots on my IRC channel, one is Vicuna and one is GPT3-based, and I feed them same prompts to compare. Vicuna feels better as a general-purpose chatbot which can talk pretty much about anything, and it's pretty imaginative. GPT3 often refuses to talk about things, and it loses track of our dialog more quickly.

Although it's not that hard to set it up, it takes time to get things right, so I thought maybe open-source my findings as some kind of middleware to quickly integrate with other projects (localhost REST API + read-to-use weights and preset prompts). So far most projects I've seen are either very low-level (like llama.cpp) or very high-level (web-based chat).

Hi, I've been trying to expose the llama.cpp as an API, you mentioned that found llamacpphtmld to solve this problem, can you share the link to this project? couldn't find it over github. Thanks!

Hi I was also looking into this and I am now using: https://github.com/abetlen/llama-cpp-python It tries to be compatible with openAI API. I managed to run AutoGPT using it (however context window is too small to be useful and even if I set it to 2048 (max) I had to tweak AutoGPT context maximum as 1024 for it to work - probably some additional wrapping or something)