The most interesting snippet in the paper I think is this:
> For instance, LLaMA-13B outperforms GPT-3 on most bench- marks, despite being 10× smaller. We believe that this model will help democratize the access and study of LLMs, since it can be run on a single GPU.
How about alignment/ability to answer prompt queries and chain of thought reasoning capabilities?
Without the fine tuning RLHF phase to make it like instructgpt I'm assuming it won't be as good as ChatGPT, is that right?
How hard would it be to fine tune the 65B model on commodity hardware?
Found answer here:
> Out-of-scope use cases LLaMA is a base, or foundational, model. As such, it should not be used on downstream applications without further risk evaluation and mitigation. In particular, our model has not been trained with human feedback, and can thus generate toxic or offensive content, incorrect information or generally unhelpful answers.
https://github.com/facebookresearch/llama/blob/main/MODEL_CA...
Yes but fine tuning for RL is not expected to be hard. You're essentially limited by how much human feedback is available, so it's very different from training the foundational model on random bulk data.
Of course, but even still the level of scale, clean data, and human supervision needed may be significant. It's reported OpenAI used an army of humans to generate question answer prompts and rate the model output.
They kept the details closely guarded and only hinted at how they did RLHF and transitioned the architecture to self supervised learning.