I think it's poor form that they are taking the GPT-4 name for an unrelated project. After all, the underlying Vicuna is merely a fine-tuned LLaMA. Plus they use the smaller 13B version.
The results look interesting, however.
Here's hoping that they'll add GTPQ 4bit quantizing so the 65B version of the model can be run on 2x 3090.
Someone needs to write a buyer's guide for GPUs and LLMs. For example, what's the best course of action if don't need to train anything but do want to eventually run whatever model becomes the first local-capable equivalent to ChatGPT? Do you go with Nvidia for the CUDA cores or with AMD for more VRAM? Do you do neither and wait another generation?
There's a subreddit r/LocalLLaMA that seems like the most active community focused on self-hosting LLMs. Here's a recent discussion on hardware: https://www.reddit.com/r/LocalLLaMA/comments/12lynw8/is_anyo...
If you're looking just for local inference, you're best bet is probably to buy a consumer GPU w/ 24GB of RAM (3090 is fine, 4090 more performance potential), which can fit a 30B parameter 4-bit quantized model that can probably be fine-tuned to ChatGPT (3.5) level quality. If not, then you can probably add a second card later on.
Alternatively, if you have an Apple Silicon Mac, llama.cpp performs surprisingly well, it's easy to try for free: https://github.com/ggerganov/llama.cpp
Current AMD consumer cards have terrible software support and IMO isn't really an option. On Windows you might be able to use SHARK or DirectML ports, but nothing will run out of the box. ROCm still has no RDNA3 support (supposedly coming w/ 5.5 but no release date announced) and it's unclear how well it'll work - basically, unless you would rather be fighting w/ hardware than playing around w/ ML, it's probably best to avoid (the older RDNA cards also don't have tensor cores, so perf would be hobbled even if you could get things running. Lots of software has been written w/ CUDA-only in mind).