What does HackerNews think of text-generation-webui?
A gradio web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, Pythia, OPT, and GALACTICA.
text-generation-webui has an OpenAI API implementation.
> I heard that you could force the model to output JSON even better than ChatGPT with a specific syntax, and that you have to structure the prompts in a certain way to get ok-ish outputs instead of nonsense.
Probably to get the maximum use out of that (particularly the support for grammars), it would be better not to use the OpenAI API implementation, and just use the native API in text-generation-webui (or any other runner for the model that supports grammars or the other features you are looking for.)
I'm not sure how experienced you are in the field but there are kind of two levels of fine-tuning, full fine-tuning (update all the weights of the model, usually requires 2-3x the memory required for inference). This allows you to change and update the knowledge contained inside the model.
If the model has sufficient understanding already of the content of the task and you want to change how it responds, such as to a specific output format, "personality" or "flavor" of output, or to have it already know the kind of task its performing without including those details in the prompt I would go with parameter efficient fine-tuning.
If you're looking to do a one-off train for a model, you might be able to get away with doing it in something like this: https://github.com/oobabooga/text-generation-webui Very easy to use project but it really doesn't allow for the kind of metrics, analysis, or professional grade hosting you'll want.
vLLM can help with the hosting and is really solid once you have the models fine-tuned we tried that at first but its core architecture simply wouldn't work for what we were trying to do which is why we went fully in-house.
Once you get into a lot of fine-tuning, you're probably going to want to do it directly in pytorch or the equivalent for your language of choice. A good resource for seeing how people do this is actually opensource models published on hugging face. Look for some LoRA models, or fine tunes similar to what you'd like. A lot of people publish their training code and datasets on GitHub which can be very useful references.
Right now I'd recommend llama2 as a base model for most general language model based tasks if you don't cross their commercial use threshold (which is very very generous).
Hope this helps!
That's just not true. You can get ooba[1] running in no time, which is 100% made for desktop usage. There's also koboldcpp and other solution also made for desktop users. In fact, most LLM communities are dominated by end users who run these LLMs on their desktops to roleplay.
AMD being awful is orthogonal here.
LLM is the swiss army knife for everyone not just developers.
As another poster mentioned though, it's nowhere near the level of GPT-4. It's close enough to GPT 3.5 though, you should try it out!
There doesn't seem to be an obvious easy way to run them with GPU support.
I would highly recommend looking into the text-generation-webui project (https://github.com/oobabooga/text-generation-webui). It has a one-click installer and very comprehensive guides for getting models running locally and where to find models. The project also has an "api" command flag to let you use it like you might use a web-based service currently.
Ooobabooga: https://github.com/oobabooga/text-generation-webui
Model: TheBloke_Llama-2-70B-chat-GPTQ from https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ
ExLlama_HF loader gpu split 20,22, context size 2048
on the Chat Settings tab, choose Instruction template tab and pick Llama-v2 from the instruction template dropdown
Use one of the one-click installers linked in the README of
https://github.com/oobabooga/text-generation-webui
and you're set.
Note that just in case you have the hardware necessary to run the biggest available model llama2-70b (for example two RTX 3090 with a total of 48GB of VRAM), there is currently a small bug (with a fix) documented at https://github.com/oobabooga/text-generation-webui/issues/32...
Which includes the llama.cpp backend, and a lot more.
Unfortunately, despite claiming to be the "Automatic1111" of text generation, it doesn't support any of the prompt engineering capabilities (i.e. negative prompts, prompt weights, prompt blending, etc) available in Automatic1111, despite the fact that it's not difficult to implement - https://gist.github.com/Hellisotherpeople/45c619ee22aac6865c...
Luckily for Ooga Booga, no one else supports it either. Why this is? I have no explanation except that the NLP community doesn't know jack about prompt engineering, which is Kafkaesque
There's a web gui for llama.cpp that is really straightforward to set up for huggingface models: https://github.com/oobabooga/text-generation-webui
Easy way to self-host it is to use text-generation-webui[1] and 33B 4-bit quantized GGML model from TheBloke[2].
Works on all platforms, but runs much better on Linux.
Running this in Docker on my 2080Ti, can barely fit 13B-4bit models into 11G of VRAM, but it works fine, produces around 10-15 tokens/second most of the time. It also has an API, that you can use with something like LangChain.
Supports multiple ways to run the models, purely with CUDA (I think AMD support is coming too) or on CPU with llama.cpp (also possible to offload part of the model to GPU VRAM, but the performance is still nowhere near CUDA).
Don't expect open-source models to perform as well as ChatGPT though, they're still pretty limited in comparison. Good place to get the models is TheBloke's page - https://huggingface.co/TheBloke. Tom converts popular LLM builds into multiple formats that you can use with textgen and he's a pillar of local LLM community.
I'm still learning how to fine-tune/train LoRAs, it's pretty finicky, but promising, I'd like to be able to feed personal data into the model and have it reliably answer questions.
In my opinion, these developments are way more exciting than whatever OpenAI is doing. No way I'm pushing my chatlogs into some corp datacenter, but running locally and storing checkpoints safely would achieve my end-goal of having it "impersonate" myself on the web.
Although I haven't had much of my time available for this recently. My recommendation would be to start with https://github.com/oobabooga/text-generation-webui
You will find almost everything you need to know there and on 4chan.org/g/catalog - search for LMG.
It'll require a beefy GPU but I've seen some fun examples like someone training a LoRA on Skyrim books.
Any person with a brain knows the only person that's going to pay a $1 per minute are the ones that want NSFW content and standing behind openAIs policy so that she can pretend to keep her virtue is a joke. Had the Chat bot been marketed as to answer content questions for fans I'd give her the benefit of the dought but to call it an AI girlfriend isnt even hiding it.
Secondly if you want an AI girlfriend just use oobabooga (https://github.com/oobabooga/text-generation-webui) and prompt your own AI girlfriend. I feel no remorse for those that pay a dollar per minute for this garbage either.
Install https://github.com/oobabooga/text-generation-webui, update pytorch and llamacpp-python, and you should be able to run pretty much all models out there, in all formats, both on GPU and CPU. CPU on a MAC gives you the fastest speed, but you should pass the correct --threads argument (investigate how many performance cores you've got). GPU is slower, but more energy efficient. https://github.com/mlc-ai/mlc-llm gives me way better GPU performance compared to oobabooga, but they only support a couple of models right now, - it's worth following their progress though.
If you're after the raw performance, I suggest using GGML models (meant for llama.cpp, but it's bundled in textgen, so you can use it there with the convenience of a web ui). q4_0 is the fastest quantization, while the q5_1 is the best quality right now.
If the GGML is not available, you can generate it quite easily from the safetensors yourself (not the you need enough RAM to load the model in pytorch though).
With 16GB RAM you can run any 13G model, as long as it's quantized to 4/5 bits. 32GB RAM allows you running 30/33G models and 64GB RAM - 65G models. 30G and 60G models are way more useful for real world tasks, but they are more expensive to train, so there aren't as many to choose from compared to 7/13. 7B and anything less is a toy in my opinion while 13B is good enough for experimentation and prototyping.
Probably the fastest way to get started is to look into [0] - this only requires a beta chromium browser with WebGPU. For a more integrated setup, I am under the impression [1] is the main tool used.
If you want to take a look at the quality possible before getting started, [2] is an online service by Hugging Face that hosts one of the best of the current generation of open models (OpenAssistant w/ 30B LLaMa)
[0]: https://mlc.ai/web-llm/ [1]: https://github.com/oobabooga/text-generation-webui [2]: https://huggingface.co/chat
> How can someone get into using these models
You can use gradio(online) or download(git will not download, it's too big, do it manually) the weights at https://huggingface.co/lmsys/vicuna-13b-delta-v1.1/tree/main and then load the model in pytourch and try inference(text generation). But you'll need either a lot of RAM(16GB,32GB+) or VRAM(Card).
> How might I go about using these models for doing things like say summarizing news articles or video transcriptions Again, you might try online or setup a python/bash/powershell script to load the model for you so you can use it. If you can pay I would recommend runpod for the shared GPUs.
> When someone tunes a model for a task, what exactly are they doing and how does this ‘change’ the model? From my view ... not much ... "fine-tuning" means training(tuning) on a specific dataset(fine, as in fine-grained). As I believe(I'm not sure) they just run more epochs on the model with the new data you have provided it until they reach a good loss(the model works), that's why quality data is important.
You might try https://github.com/oobabooga/text-generation-webui they have a pretty easy setup config. Again, you'll need a lot of RAM and a good CPU for inference on CPU or a GPU.
https://huggingface.co/lmsys/vicuna-13b-delta-v1.1/tree/main
Check out textgen, it has voice in/out, graphics in/out, memory plugin, api, plugins, etc, all running locally.
Here's results of running on Android: https://github.com/ggerganov/llama.cpp/issues/124
This is about running llama on a Raspberry Pi: https://github.com/ggerganov/llama.cpp/issues/58
...and this is where people have been posting their results running on all sorts of hardware, though I don't see anything Android related: https://github.com/facebookresearch/llama/issues/79
Obviously the larger models won't run on such limited hardware (yet) but one of the next big projects (that I can see) being worked on is converting the models to be 3bit (currently 8bit and 4bit are popular) which cuts down required resources drastically with minimal noticeable loss in quality.
I think starting with FlexGen barely 4 weeks ago, there have been some pretty crazy LLM projects/forks popping up on github almost daily. With FlexGen I felt like I was still able to stay up-to-date but I'm getting close to giving up trying as things are moving exponentially faster... you know it's crazy when a ton of noobs who have never heard of conda are getting this stuff running (sometimes coming in flexgen discord or posting github issues to get help, though even those are becoming rarer as one-click-installer's are becoming a thing for some popular ML tools, such as oobabooga's amazing webui tool which has managed to integrate almost all the hottest new feature forks fairly quickly: https://github.com/oobabooga/text-generation-webui
I just helped someone recently get oobabooga running which has a --listen option to open the webui to your network, now he's running llama on his tablet (via his PC).
The 7B model will run without changes on a 3080. The 13B 4-bit model also runs on a 3080.
This Reddit post has the instructions I followed: https://old.reddit.com/r/LocalLLaMA/comments/11o6o3f/how_to_...
Mostly using https://github.com/oobabooga/text-generation-webui/, the AUTOMATIC1111 of textgen.
Ooba's textgen webui runs LLaMA 4bit on 8GB VRAM consumer video cards.
KoboldAI (https://github.com/henk717/KoboldAI/) by the creator the AI Horde of stable diffusion fame also supports LLaMA and has forked the Stable Horde for textgen. Currently horde is running half a dozen LLaMA models accessible at http://lite.koboldai.net/
Just not on Macs. (that repo does not support Apple Silicon)
],
[
"How does the Schrodinger's cat thought experiment relate to the principles of quantum mechanics?",
"In this famous paradox, a cat inside a box is simultaneously both alive and dead due to the nature of subatomic particles. This demonstrates that at the smallest scales of reality, we cannot predict whether something will happen until we observe it happening."
],
[
"What are some of the potential consequences of climate change on global food security?",
"According to researchers at Stanford University, rising temperatures could lead to crop failures and reduced harvests for staple crops such as rice, wheat, corn, and soybeans by midcentury."
],
[
"Can you provide a brief overview of the history of political philosophy, from Ancient Greece to the present day?",
"Absolutely! Political philosophy is a branch of philosophy concerning the role of government in society. In ancient times, Plato and Aristotle were among the first thinkers to consider the subject in depth. Since then, other influential philosophers such as Thomas Hobbes and John Locke have expanded upon these ideas. Today, contemporary philosophers such as John Rawls continue to debate the merits of different political systems."
]
In your case and ChatGPT3, does is it provide output based on the data you feed it? If that is the case, is there anything related to training the model to use your data?
I am trying to gauge a sense of what is going on.