What does HackerNews think of evals?

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Language: Python

OpenAI open sourced their evals framework. You can use it to evaluate different models but also your entire prompt chain setup. https://github.com/openai/evals

They also have a registry of evals built in.

"What if" is all these "existential risk" conversations ever are.

Where is your evidence that we're approaching human level AGI, let alone SuperIntelligence? Because ChatGPT can (sometimes) approximate sophisticated conversation and deep knowledge?

How about some evidence that ChatGPT isn't even close? Just clone and run OpenAI's own evals repo https://github.com/openai/evals on the GPT-4 API.

It performs terribly on novel logic puzzles and exercises that a clever child could learn to do in an afternoon (there are some good chess evals, and I submitted one asking it to simulate a Forth machine).

You can get GPT 4 access by submitting an eval if gets merged (https://github.com/openai/evals). Here's the one that got me access[1]

Although from the blog post it looks like they're planning to open up to everyone soon, so that may happen before you get through the evals backlog.

1: https://github.com/openai/evals/pull/778

Quality isn’t unquantifiable. Search engines have been qualifying their results quality for a long time. OpenAI recently released a package for quantifying a LLM’s quality https://github.com/openai/evals

I don’t know what to say about your CTO. People are definitely overestimating how useful GitHub Copilot is at least (if they say 2x they’re wrong). But there is no doubt that these products are changing the world and making us more productive.

>Does ChatGPT have a large test suite consisting of a large number of input question and expected responses that have to match?

They are trying to crowdsource this with OpenAI evals. https://github.com/openai/evals

I'm sure they have a lot of internal benchmarks too, but of course they won't share them.

>If not, there can be no "bug" in ChatGPT.

I don't understand the objection. Are you claiming that bugs only exist if you have testcases for them?

OpenAI has been collecting a ton of evals here https://github.com/openai/evals with many of them including some comments about how well GPT-4 does vs GPT-3.5.

You could clone that repo, adapt the oaieval script to run against different APIs, then run the evals against both and compare the results.

There is also a lot of work in benchmarking for AI as well. This is where things like Resnet come from.

But the point of using these tests for AI is precisely the reason we use for giving them to humans -- we think we know what it measures. AI is not intended to be a computation engine or a number crunching machine. It is intended to do things that historically required "human intelligence".

If there are better tests of human intelligence, I think that the AI community would be very interested in learning about them.

See: https://github.com/openai/evals

summary:

1. GPT4 is multimodal (text + image inputs => text outputs). This is being released piecemeal - with text input first via ChatGPT Plus subscribers https://beta.openai.com/docs/api-reference/generations/creat..., and via API https://beta.openai.com/docs/api-reference/introduction with waitlist (https://openai.com/waitlist/gpt-4-api). Image capability released via https://www.bemyeyes.com/.

2. GPT4 exhibits human level performance on various benchmarks (For example, it passes a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%. see visual https://twitter.com/swyx/status/1635689844189036544)

3. GPT4 training used the same Azure supercomputer as GPT 3.5, but was a lot more stable: "becoming our first large model whose training performance we were able to accurately predict ahead of time."

4. Also open-sourcing OpenAI Evals https://github.com/openai/evals, a framework for automated evaluation of AI model performance, to allow anyone to report shortcomings in OpenAI models to help guide further improvements.

Paper: https://cdn.openai.com/papers/gpt-4.pdf