What does HackerNews think of text-generation-webui?

A gradio web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA.

Language: Python

I use https://github.com/oobabooga/text-generation-webui separately as I code in VS 2022, VS Code, Intellij Idea and Webstorm
> Is there some convenience wrapper around this to drop-in replace the OpenAI api with it?

text-generation-webui has an OpenAI API implementation.

> I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.

Probably to get the maximum use out of that (particularly the support for grammars), it would be better not to use the OpenAI API implementation, and just use the native API in text-generation-webui (or any other runner for the model that supports grammars or the other features you are looking for.)

[0] https://github.com/oobabooga/text-generation-webui

I appreciate the offer but I'm a bit underwater with the amount that I have on my plate right now. We're using a custom solution in-house for all of our training and hosting and it can definitely be daunting to get that far.

I'm not sure how experienced you are in the field but there are kind of two levels of fine-tuning, full fine-tuning (update all the weights of the model, usually requires 2-3x the memory required for inference). This allows you to change and update the knowledge contained inside the model.

If the model has sufficient understanding already of the content of the task and you want to change how it responds, such as to a specific output format, "personality" or "flavor" of output, or to have it already know the kind of task its performing without including those details in the prompt I would go with parameter efficient fine-tuning.

If you're looking to do a one-off train for a model, you might be able to get away with doing it in something like this: https://github.com/oobabooga/text-generation-webui Very easy to use project but it really doesn't allow for the kind of metrics, analysis, or professional grade hosting you'll want.

vLLM can help with the hosting and is really solid once you have the models fine-tuned we tried that at first but its core architecture simply wouldn't work for what we were trying to do which is why we went fully in-house.

Once you get into a lot of fine-tuning, you're probably going to want to do it directly in pytorch or the equivalent for your language of choice. A good resource for seeing how people do this is actually opensource models published on hugging face. Look for some LoRA models, or fine tunes similar to what you'd like. A lot of people publish their training code and datasets on GitHub which can be very useful references.

Right now I'd recommend llama2 as a base model for most general language model based tasks if you don't cross their commercial use threshold (which is very very generous).

Hope this helps!

> Not a big fan of how server-centric the LLM landscape is.

That's just not true. You can get ooba[1] running in no time, which is 100% made for desktop usage. There's also koboldcpp and other solution also made for desktop users. In fact, most LLM communities are dominated by end users who run these LLMs on their desktops to roleplay.

AMD being awful is orthogonal here.

[1] https://github.com/oobabooga/text-generation-webui

I like this one: https://github.com/oobabooga/text-generation-webui

LLM is the swiss army knife for everyone not just developers.

I've been playing a lot with Llama 2 13B recently, and it's really not bad at all. With oobabooga[1] you get a proper UI for it and even get an OpenAI-compatible API, so you just change the endpoint in your OpenAI library and it all works. I've been using that to test changes to my bots.

As another poster mentioned though, it's nowhere near the level of GPT-4. It's close enough to GPT 3.5 though, you should try it out!

[1]: https://github.com/oobabooga/text-generation-webui

I've tried running some using https://github.com/oobabooga/text-generation-webui on my arch box with an AMD gpu and it worked well and was really handy, until I tried yesterday and I get some python error no matter what I do, with the same environment as before and with a new one \o/

There doesn't seem to be an obvious easy way to run them with GPU support.

The bash script is downloading llama.cpp, a project which allows you to run LLaMA-based language models on your CPU. The bash script then downloads the 13 billion parameter GGML version of LLaMA 2. The GGML version is what will work with llama.cpp and uses CPU for inferencing. There are ways to run models using your GPU, but it depends on your setup whether it will be worth it.

I would highly recommend looking into the text-generation-webui project (https://github.com/oobabooga/text-generation-webui). It has a one-click installer and very comprehensive guides for getting models running locally and where to find models. The project also has an "api" command flag to let you use it like you might use a web-based service currently.

Yes, it runs totally fine. I ran it in Oobabooga/text generation web ui. Nice thing about it is that it autodownloads all necessary gpu binaries on it's own and creates a isolated conda env. I asked same questions on the official 70b demo and got same answers. I even got better answers with ooba, since the demo cuts text early

Ooobabooga: https://github.com/oobabooga/text-generation-webui

Model: TheBloke_Llama-2-70B-chat-GPTQ from https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ

ExLlama_HF loader gpu split 20,22, context size 2048

on the Chat Settings tab, choose Instruction template tab and pick Llama-v2 from the instruction template dropdown

Demo: https://huggingface.co/blog/llama2#demo

The correct answer, as always, is the oogabooga text generation webUI, which supports all of the relevant backends: https://github.com/oobabooga/text-generation-webui
This is blogspam, but in case people come here because the title sounds interesting, check out text-generation-webui (https://github.com/oobabooga/text-generation-webui). It's a better solution to local LLMs than running in a Jupyter Notebook.
Simon's article didn't show local usage.

Use one of the one-click installers linked in the README of

https://github.com/oobabooga/text-generation-webui

and you're set.

Note that just in case you have the hardware necessary to run the biggest available model llama2-70b (for example two RTX 3090 with a total of 48GB of VRAM), there is currently a small bug (with a fix) documented at https://github.com/oobabooga/text-generation-webui/issues/32...

The real gold standard is https://github.com/oobabooga/text-generation-webui

Which includes the llama.cpp backend, and a lot more.

Unfortunately, despite claiming to be the "Automatic1111" of text generation, it doesn't support any of the prompt engineering capabilities (i.e. negative prompts, prompt weights, prompt blending, etc) available in Automatic1111, despite the fact that it's not difficult to implement - https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...

Luckily for Ooga Booga, no one else supports it either. Why this is? I have no explanation except that the NLP community doesn't know jack about prompt engineering, which is Kafkaesque

Well not really using it per-se, just running it and others for fun on my home pc, 4bit 30B models need about 20G of regular RAM. CPU inference is kinda slow, but it's not too terrible (e.g. 1 min for a full response on my 6 core Ryzen 2nd gen).

There's a web gui for llama.cpp that is really straightforward to set up for huggingface models: https://github.com/oobabooga/text-generation-webui

Guanaco is indeed very capable and can replace GPT 3.5 in almost all scenarios, based on my tests.

Easy way to self-host it is to use text-generation-webui[1] and 33B 4-bit quantized GGML model from TheBloke[2].

[1] https://github.com/oobabooga/text-generation-webui

[2] https://huggingface.co/TheBloke/guanaco-33B-GGML

https://github.com/oobabooga/text-generation-webui/

Works on all platforms, but runs much better on Linux.

Running this in Docker on my 2080Ti, can barely fit 13B-4bit models into 11G of VRAM, but it works fine, produces around 10-15 tokens/second most of the time. It also has an API, that you can use with something like LangChain.

Supports multiple ways to run the models, purely with CUDA (I think AMD support is coming too) or on CPU with llama.cpp (also possible to offload part of the model to GPU VRAM, but the performance is still nowhere near CUDA).

Don't expect open-source models to perform as well as ChatGPT though, they're still pretty limited in comparison. Good place to get the models is TheBloke's page - https://huggingface.co/TheBloke. Tom converts popular LLM builds into multiple formats that you can use with textgen and he's a pillar of local LLM community.

I'm still learning how to fine-tune/train LoRAs, it's pretty finicky, but promising, I'd like to be able to feed personal data into the model and have it reliably answer questions.

In my opinion, these developments are way more exciting than whatever OpenAI is doing. No way I'm pushing my chatlogs into some corp datacenter, but running locally and storing checkpoints safely would achieve my end-goal of having it "impersonate" myself on the web.

So far my experience with Vicunlocked30b has been pleasant. https://huggingface.co/TheBloke/VicUnlocked-30B-LoRA-GGML

Although I haven't had much of my time available for this recently. My recommendation would be to start with https://github.com/oobabooga/text-generation-webui

You will find almost everything you need to know there and on 4chan.org/g/catalog - search for LMG.

What's super confusing is people on r/LocalLLaMA often just refer too it as "ooba". The actual repo is https://github.com/oobabooga/text-generation-webui and you notice that oobabooga is actually the username, but a lot of non-technical people who don't use git are also playing around with stuff, and "text-generation-webui" is so generic as to be completely forgettable, and "oobabooga" is confusing and too long to say, so colloquially it's referred to as ooba.
If you're running the text-generation-webui (https://github.com/oobabooga/text-generation-webui) it has the ability to train LoRAs.

It'll require a beefy GPU but I've seen some fun examples like someone training a LoRA on Skyrim books.

You might consider supporting ooba's api which would give you a lot of support for different things really quickly.

https://github.com/oobabooga/text-generation-webui/

Wow this is sad on a whole new level. The idea of just making chat bot to "alleviate loneliness" is such a bold lie that's ironic at best.

Any person with a brain knows the only person that's going to pay a $1 per minute are the ones that want NSFW content and standing behind openAIs policy so that she can pretend to keep her virtue is a joke. Had the Chat bot been marketed as to answer content questions for fans I'd give her the benefit of the dought but to call it an AI girlfriend isnt even hiding it.

Secondly if you want an AI girlfriend just use oobabooga (https://github.com/oobabooga/text-generation-webui) and prompt your own AI girlfriend. I feel no remorse for those that pay a dollar per minute for this garbage either.

ARM-based Macs are the easiest way to get an acceptable performance without the headaches right now, if you can afford the price.

Install https://github.com/oobabooga/text-generation-webui, update pytorch and llamacpp-python, and you should be able to run pretty much all models out there, in all formats, both on GPU and CPU. CPU on a MAC gives you the fastest speed, but you should pass the correct --threads argument (investigate how many performance cores you've got). GPU is slower, but more energy efficient. https://github.com/mlc-ai/mlc-llm gives me way better GPU performance compared to oobabooga, but they only support a couple of models right now, - it's worth following their progress though.

If you're after the raw performance, I suggest using GGML models (meant for llama.cpp, but it's bundled in textgen, so you can use it there with the convenience of a web ui). q4_0 is the fastest quantization, while the q5_1 is the best quality right now.

If the GGML is not available, you can generate it quite easily from the safetensors yourself (not the you need enough RAM to load the model in pytorch though).

With 16GB RAM you can run any 13G model, as long as it's quantized to 4/5 bits. 32GB RAM allows you running 30/33G models and 64GB RAM - 65G models. 30G and 60G models are way more useful for real world tasks, but they are more expensive to train, so there aren't as many to choose from compared to 7/13. 7B and anything less is a toy in my opinion while 13B is good enough for experimentation and prototyping.

The quality does not yet measure up exactly to ChatGPT (even 3.5), but yes it is possible

Probably the fastest way to get started is to look into [0] - this only requires a beta chromium browser with WebGPU. For a more integrated setup, I am under the impression [1] is the main tool used.

If you want to take a look at the quality possible before getting started, [2] is an online service by Hugging Face that hosts one of the best of the current generation of open models (OpenAssistant w/ 30B LLaMa)

[0]: https://mlc.ai/web-llm/ [1]: https://github.com/oobabooga/text-generation-webui [2]: https://huggingface.co/chat

(I'm not an expert)

> How can someone get into using these models

You can use gradio(online) or download(git will not download, it's too big, do it manually) the weights at https://huggingface.co/lmsys/vicuna-13b-delta-v1.1/tree/main and then load the model in pytourch and try inference(text generation). But you'll need either a lot of RAM(16GB,32GB+) or VRAM(Card).

> How might I go about using these models for doing things like say summarizing news articles or video transcriptions Again, you might try online or setup a python/bash/powershell script to load the model for you so you can use it. If you can pay I would recommend runpod for the shared GPUs.

> When someone tunes a model for a task, what exactly are they doing and how does this ‘change’ the model? From my view ... not much ... "fine-tuning" means training(tuning) on a specific dataset(fine, as in fine-grained). As I believe(I'm not sure) they just run more epochs on the model with the new data you have provided it until they reach a good loss(the model works), that's why quality data is important.

You might try https://github.com/oobabooga/text-generation-webui they have a pretty easy setup config. Again, you'll need a lot of RAM and a good CPU for inference on CPU or a GPU.

https://huggingface.co/lmsys/vicuna-13b-delta-v1.1/tree/main

Local LLMs already have a UI intentionally similar to AUTOMATIC1111, including LoRAs, training with checkpoints, various extensions including multimodal and experimental long-term memory etc.

https://github.com/oobabooga/text-generation-webui

This really ought to mention https://github.com/oobabooga/text-generation-webui, which was the first popular UI for LLaMA, and remains one for anyone who runs it on GPU. It is also where GPTQ 4-bit quantization was first enabled in a LLaMA-based chatbot; llama.cpp picked it up later.
I don't think it's possible to point Copilot to other models. I don't think Microsoft would benefit much from that feature. You could use existing tools [0] to host your own model which in theory could be used by an extension your IDE uses. But I'm not sure if an extension like that exists.

[0] https://github.com/oobabooga/text-generation-webui

Been using textgen and downloading tons of models, the models are all over the place. The problems of accuracy and short term memory are major issues that people are trying to implement work arounds.

Check out textgen, it has voice in/out, graphics in/out, memory plugin, api, plugins, etc, all running locally.

https://github.com/oobabooga/text-generation-webui

The default loader doesn't seem to let you load quantized models but if you use something like https://github.com/oobabooga/text-generation-webui you can 1) use the model with `--load-in-8bit` which halves the memory (runs on my 24GB consumer card w/o an issue then, probably would fit on a 16GB card). There are also 4-bit quantized models and you can run probably `anon8231489123/vicuna-13b-GPTQ-4bit-128g --model_type LLaMA --wbits 4 --groupsize 128` although there have been reports that bitsandbytes have problems w/ 4bit perf on some cards: https://github.com/TimDettmers/bitsandbytes/issues/181
It's "all" over the news now ;) https://arstechnica.com/information-technology/2023/03/you-c...

Here's results of running on Android: https://github.com/ggerganov/llama.cpp/issues/124

This is about running llama on a Raspberry Pi: https://github.com/ggerganov/llama.cpp/issues/58

...and this is where people have been posting their results running on all sorts of hardware, though I don't see anything Android related: https://github.com/facebookresearch/llama/issues/79

Obviously the larger models won't run on such limited hardware (yet) but one of the next big projects (that I can see) being worked on is converting the models to be 3bit (currently 8bit and 4bit are popular) which cuts down required resources drastically with minimal noticeable loss in quality.

I think starting with FlexGen barely 4 weeks ago, there have been some pretty crazy LLM projects/forks popping up on github almost daily. With FlexGen I felt like I was still able to stay up-to-date but I'm getting close to giving up trying as things are moving exponentially faster... you know it's crazy when a ton of noobs who have never heard of conda are getting this stuff running (sometimes coming in flexgen discord or posting github issues to get help, though even those are becoming rarer as one-click-installer's are becoming a thing for some popular ML tools, such as oobabooga's amazing webui tool which has managed to integrate almost all the hottest new feature forks fairly quickly: https://github.com/oobabooga/text-generation-webui

I just helped someone recently get oobabooga running which has a --listen option to open the webui to your network, now he's running llama on his tablet (via his PC).

Another great option is https://github.com/oobabooga/text-generation-webui

The 7B model will run without changes on a 3080. The 13B 4-bit model also runs on a 3080.

This Reddit post has the instructions I followed: https://old.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_...

Might I suggest looking the story between the 2nd and 10th of march? I've noticed Hacker News hasn't been following certain areas of the effort. A lot of great work had happened and continues to be happen in close conjunction with the text-generation-webui (seriously, most of the cutting edge with 4-bit GPTQ etc. has been closely tied to the project).

>https://github.com/oobabooga/text-generation-webui/

https://github.com/oobabooga/text-generation-webui/ has been running 4bit LLaMA on Radeon hardware with faster speeds and lower memory requirements than this LLaMA.cpp for the past week!
Cards I have seen LLaMA run on in 8bit and 4bit include: RTX 1660, RTX 2060, AMD 5700xt, RTX 3050, RTX 3060, AMD 6900xt, RTX 2060 12GB, RTX 3060 12GB, RTX 3080, P5000, RTX A2000, RTX 3080 20GB, RTX A4500, RTX A5000, RTX 3090, RTX 4090, RTX 6000, Tesla V100, A100 40GB, A40, RTX A6000, RTX 8000, Titan Ada

Mostly using https://github.com/oobabooga/text-generation-webui/, the AUTOMATIC1111 of textgen.

Massively diverse community working on the AUTOMATIC1111 of textgen at https://github.com/oobabooga/text-generation-webui/

Ooba's textgen webui runs LLaMA 4bit on 8GB VRAM consumer video cards.

KoboldAI (https://github.com/henk717/KoboldAI/) by the creator the AI Horde of stable diffusion fame also supports LLaMA and has forked the Stable Horde for textgen. Currently horde is running half a dozen LLaMA models accessible at http://lite.koboldai.net/

People have been running LLaMA in 4bit quickly on cheap hardware with a simple GUI for over a week using https://github.com/oobabooga/text-generation-webui

Just not on Macs. (that repo does not support Apple Silicon)

I've gotten great q&a / conversational results from llama 7b and 13b using https://github.com/oobabooga/text-generation-webui.

    ],
    [
      "How does the Schrodinger's cat thought experiment relate to the principles of quantum mechanics?",
      "In this famous paradox, a cat inside a box is simultaneously both alive and dead due to the nature of subatomic particles. This demonstrates that at the smallest scales of reality, we cannot predict whether something will happen until we observe it happening."
    ],
    [
      "What are some of the potential consequences of climate change on global food security?",
      "According to researchers at Stanford University, rising temperatures could lead to crop failures and reduced harvests for staple crops such as rice, wheat, corn, and soybeans by midcentury."
    ],
    [
      "Can you provide a brief overview of the history of political philosophy, from Ancient Greece to the present day?",
      "Absolutely! Political philosophy is a branch of philosophy concerning the role of government in society. In ancient times, Plato and Aristotle were among the first thinkers to consider the subject in depth. Since then, other influential philosophers such as Thomas Hobbes and John Locke have expanded upon these ideas. Today, contemporary philosophers such as John Rawls continue to debate the merits of different political systems."
    ]
Pardon my ignorance here. I started to play around with text generation today and came around plenty of resource but hard to make any sense of it. I had this working https://github.com/oobabooga/text-generation-webui and instead of it being able to answer question, it revolves around the concept of generating text.

In your case and ChatGPT3, does is it provide output based on the data you feed it? If that is the case, is there anything related to training the model to use your data?

I am trying to gauge a sense of what is going on.