Pytorch (+GPU) dependency and python container type diversity are particularly bad. Programmers may not perceive this since they're already managing their python environment keeping all the OS/libs/containers/applications in the alignment required for things to work but it's quite complex. I couldn't do it.

In comparison I could just type git clone https://github.com/ggerganov/llama.cpp and make . And it worked. And since then I've managed to get llama.cpp clBLAS partial GPU acceleration working with my AMD RX 580 8GB. Plus with the llama.cpp CPU mmap stuff I can run multiple LLM IRC bot processes using the same model all sharing the RAM representation for free. Are there even ways to run 2 or 3 bit models in pytorch implementations like llama.cpp can do? It's pretty rad I could run a 65B llama in 27 GB of RAM on my 32GB RAM system (and still get better perplexity than 30B 8 bit).