I would like to use stuff like this as a side-project. Buy a Nvidia Geforce GPU and stick it into my 24/7 server and play around with it in my free time, to see what can be done.
The issue with all these AI models is that there's no information on which GPU is enough for which task. I'm absolutely clueless if a single RTX 4000 SFF with its 20GB VRAM and only 70W of max power usage will be a waste of money, or really something great to do experiments on. Like do some ASR with Whisper, images with Stable Diffusion or load a LLM onto it, or this project here from Facebook.
Renting a GPU in the cloud doesn't seem to be a solution for this use case, where you just want to let something run for a couple of days and see if it's useful for something.
This repo lists very specific VRAM usage for various LLaMA models (w/ group size, and accounting for context window, which is often missing) - this are all 4-bit GPTQ quantized models: https://github.com/turboderp/exllama
Note the latest versions of llama.cpp now have decent GPU support and has both a memory tester and lets you load partial models (n-layers) into your GPU. It inferences about 2X slower than exllama from my testing on a RTX 4090, but still about 6X faster than my CPU (Ryzen 5950X).
Again this is inferencing. For training, pay attention to 4-bit bitsandbytes, coming soon: https://twitter.com/Tim_Dettmers/status/1657010039679512576