What does HackerNews think of WizardLM?

Family of instruction-following LLMs powered by Evol-Instruct: WizardLM, WizardCoder and WizardMath

Language: Python

This post is misleading, in a way that is hard to do accidentally.

  - They compare the performance of this model to the worst 7B code llama model.  The base code llama 7B python model scores 38.4% on humaneval, versus the non-python model, which only scores 33%.
  - They compare their instruct tuned model to non-instruct-tuned models.  Instruction tuning can add 20% or more to humaneval performance.  For example, WizardLM 7B scores 55% on humaneval [1], and I've trained a 7B model that scores 62% [2].
  - For another example of instruction tuning, Stablecode instruct tuned benchmarks at 26%, not the 20% they cite for the base model [3]
  - Starcoder, when prompted properly, scores 40% on humaneval [4]
  - They do not report their base model performance (as far as I can tell)
This is interesting work, and a good contribution, but it's important to compare similar models.

[1] https://github.com/nlpxucan/WizardLM

[2] https://huggingface.co/vikp/llama_coder

[3] https://stability.ai/blog/stablecode-llm-generative-ai-codin...

[4] https://github.com/huggingface/blog/blob/main/starcoder.md

This appears to be a web frontend with authentication for Azure's OpenAI API, which is a great choice if you can't use Chat GPT or its API at work.

If you're looking to try the "open" models like Llama 2 (or it's uncensored version Llama 2 Uncensored), check out https://github.com/jmorganca/ollama or some of the lower level runners like llama.cpp (which powers the aforementioned project I'm working on) or Candle, the new project by hugging face.

What's are folks' take on this vs Llama 2, which was recently released by Facebook Research? While I haven't tested it extensively, 70B model is supposed to rival Chat GPT 3.5 in most areas, and there are now some new fine-tuned versions that excel at specific tasks like coding (the 'codeup' model) or the new Wizard Math (https://github.com/nlpxucan/WizardLM) which claims to outperform ChatGPT 3.5 on grade school math problems.

It's not quite so trivial to implement this solution. SL instruction tuning actually needs a lot of examples, and only recently there have been approaches to automate this, like WizardLM: https://github.com/nlpxucan/WizardLM

To try my solution, this would have to be adapted to more complex training examples involving quoted text with prompt injection attempts.

Similar points holds for RL. I actually think it is much more clean to solve it during instruction tuning, but perhaps we also need some RL. This normally requires training a reward model with large amounts of human feedback. Alternative approaches like Constitutional AI would first have to be adapted to cover quotes with prompt injection attacks.

Probably doable, but takes some time and effort, all the while prompt injection doesn't seem to be a big practical issue currently.