I've used both the 7B and 13B instruction tuned llama weights (quantized using the llama.cpp scripts). Either I am doing something wrong, or these two models are no-where near the level of ChatGPT. Many times they return something totally irrelevant to my question, stop responding, use a different language, or otherwise return the wrong answer. ChatGPT does none of this. (other than the wrong answer due to hallucinating sometimes...)
Reading through the README and issues on the llama.cpp project, there is some speculation that there is a bug in the quantization, or possibly a bug in the inference (less likely I think).
I hope this is true and once fixed the models can perform up to or past the ChatGPT level. If its not true and these models are performing correctly, then either the metrics used to compare it to GPT is garbage and don't capture the real world uses, or the instruction tuning done by the Stanford team is not up to par.