Looking at the ARC problems that the model didn't solve correctly I honestly have no idea in some of done on how the model of wrong or what should have been the solution given the train example.
How strong is a human being on this challenge?
It's quite hard. You can download the dataset here [1] and it comes with a little webpage so that you can try it yourself.
It's worth noting that you are allowed to make three guesses.