Could someone with experience explain: what's the theoretical minimum hardware requirement for llama 7B, 15B, etc, that still provides output on the order of <1sec/token?

It seems like we can pull some tricks, like using F16, and some kind of quantization, etc.

At the end of the day, how much overhead is left that can be reduced? What can I expect to have running on 16gb ram with a 3080 and a midrange AMD processor?

The 4-bit GPTQ LLaMA models are the current top-performers. This site has done a lot of the heavy lifting: https://github.com/qwopqwop200/GPTQ-for-LLaMa

With 30b-4bit on a RTX 4090, I'm seeing numbers like:

Output generated in 4.17 seconds (4.03 tokens/s, 21 tokens)

Output generated in 4.38 seconds (4.25 tokens/s, 23 tokens)

Output generated in 4.57 seconds (4.25 tokens/s, 24 tokens)

Output generated in 3.86 seconds (3.40 tokens/s, 17 tokens)

The lower size (7b, 13b) are even faster with lower memory use. A 16GB 3080 should be able to run the 13b at 4-bit just fine with reasonable (>1 token/s) latency.