I think it's interesting that they've benchmarked it against an array of standardized tests. Seems like LLMs would be particularly well suited to this kind of test by virtue of it being simple prompt:response, but I have to say...those results are terrifying. Especially when considering the rate of improvement. bottom 10% to top 10% of LSAT in <1 generation? +100 pts on SAT reading, writing, math? Top 1% In GRE Reading?

What are the implications for society when general thinking, reading, and writing becomes like Chess? Even the best humans in the world can only hope to be 98% accurate their moves (and the idea of 'accuracy' here only existing because we have engines that know, unequivocally the best move), and only when playing against other humans - there is no hope of defeating even less advanced models.

What happens when ALL of our decisions can be assigned an accuracy score?

kenjackson

We benchmark humans with these tests -- why would we not do that for AIs?

The implications for society? We better up our game.

dragonwriter

> We benchmark humans with these tests – why would we not do that for AIs?

Because the correlation between the thing of interest and what the tests measure may be radically different for systems that are very much unlike humans in their architecture than they are for humans.

There’s an entire field about this in testing for humans (psychometry), and approximately zero on it for AIs. Blindly using human tests – which are proxy measures of harder-to-directly-assess figures of merit requiring significant calibration on humans to be valid for them – for anything else without appropriate calibration is good for generating headlines, but not for measuring anything that matters. (Except, I guess, the impact of human use of them for cheating on the human tests, which is not insignificant, but not generally what people trumpeting these measures focus on.)

kenjackson

There is also a lot of work in benchmarking for AI as well. This is where things like Resnet come from.

But the point of using these tests for AI is precisely the reason we use for giving them to humans -- we think we know what it measures. AI is not intended to be a computation engine or a number crunching machine. It is intended to do things that historically required "human intelligence".

If there are better tests of human intelligence, I think that the AI community would be very interested in learning about them.

See: https://github.com/openai/evals