Looking at the ARC problems that the model didn't solve correctly I honestly have no idea in some of done on how the model of wrong or what should have been the solution given the train example.

How strong is a human being on this challenge?

It's quite hard. You can download the dataset here [1] and it comes with a little webpage so that you can try it yourself.

It's worth noting that you are allowed to make three guesses.

[1]: https://github.com/fchollet/ARC