it costs 0.001$ per 1K which is slightly cheaper than GPT-3.5-turbo. I have just tested it and it shows extremely worse results on the tasks in my pipelines. Not a game change, unfortunately.

This seems to be the general pattern so far. A particular benchmark shows better or equal performance for a non-openai model, but then someone else tries a different task and it's just not even close.

I think that's really significant. It's arguably a case for fine-tuning models for specific tasks. That's great for teams with ML experience. But for product engineering teams without ML engineers, they can just use a foundational model and get great performance for a low cost.

Something to realize is that different models require different prompting styles. You can't prompt non-gpt4 models with GPT4 tuned stylistic ticks and expect similar results.

I've gotten great performance from llama2 derivatives. Out of the box performance is not near GPT4 but it is still very strong in its own right. And, if you are able to break down your problem so precise logit control coupled with guidance from forward or backwards chaining on knowledge graphs is applicable, you can easily exceed gpt4's reasoning ability for your domain. No fine-tuning necessary either.

I've been getting useful things out of LLMs since the days of roberta and raw T5, when Large stood for hundreds of millions of parameters. I am flabbergasted when people say a 7B parameter model is no good for them.

> precise logit control coupled with guidance from forward or backwards chaining on knowledge graphs

What do you mean by this?

I'm not 100% sure they're talking about this specifically but logit control/manipulation is often used to conform to a specific schema.

https://github.com/guidance-ai/guidance

I'm going to butcher this explanation - after you've generated your selection of logits but before you sample from them, you check which ones conform to your schema. If you want the only two options to be "true" or "false", then you take any of the logits that would provide invalid answers and lower their probabilities manually.

Another example is structures like JSON can be validated so when your sample is "{'name':'Carl'" you lower the probability of "{" since that would invalidate the json. In fact the only valid ones you'd likely have left would be ",", " ", and "}"