It's worth noting that you are allowed to make three guesses.
EDIT: the GitHub page has an actual example. You see three input/output pairs to learn the rule, which you then apply to a fourth input. So the task is slightly different on each test case. https://github.com/fchollet/ARC
The ARC 1 dataset is here. https://github.com/fchollet/ARC
The ARC 2 dataset is crowd sourced. If you can come up with a challenging task, then please contribute it.
It's not impossible but I'm concerned that neural networks will only be used as an excuse to not have to understand anything anymore (not just in biology, in all of the sciences). That would be a terrible outcome- a dumbing down of science and an eventual loss of our ability to understand the world. As you say, time will be the judge of that.
>> Show me that applied to: https://github.com/fchollet/ARC
I had a look at ARC back when François Chollet's paper came out that presented it. I was interested because a) it's the kind of problem that ILP eats for breakfast and b) Chollet's name is well-known and it would be a chance to get our work noticed by people who normally wouldn't notice it.
However.
ARC is proposed as a benchmark that is hard for current big-data approaches and that would need elements of intelligent reasoning to solve (if I understand correctly, that's why you brought it up yourself?). Such benchmarks have been proposed before, in particular the Bongard problems [1], in machine vision, and the Winograd schemas [2], in NLP. As with ARC, the "defense" of such datasets against dumb, no-reasoning, big-data approaches is that there are few examples of each problem task. And yet, in both Bongard problems and Winograd schemas, neural nets have now achieved high accuracy. How did they do it? They did it by cheating: instead of training on the original, and few, Bongard problems, for example, people created generators for similar problems that could generate many thousands of problems [3]. That way, the neural nets folks had their big data and they could train their big networks. Same for Winograd schemas.
... And the same thing has already happened, at a preliminary stage, with ARC:
https://arxiv.org/abs/2011.09860
In the paper I link above, the authors use a data augmentation technique that consists of rotations and colour transformations of problem tasks. It should be obvious that this adds no useful information that can help any attempt at solving ARC problems with reasoning, and that it only serves to help a neural net better overfit to the training tasks. And yet, the system in the paper achieves good results in a small selection of ARC tasks. Admittedly, that is a very small selection (only 10x10 grids and not on any of the held-out test set that only François Chollet has access to) but the important point is that it is possible to make progress in solving ARC without actually showing any reasoning ability, despite the claims of the Kolev et al. paper above; and despite Chollet's intent for ARC to avoid exactly that.
This is pretty standard in deep learning and the reason why I gripe about benchmarks in my earlier comment. Deep learning systems consistently beat benchmarks designed to test abilities that deep learning systems don't have. They do this by cheating, enabled by weaknesses in said benchmarks. They cheat by overfitting to surface statistical regularities of the data (see earlier cited papers) thus learning "shortcuts" around the intended difficulties of the benchmarks. The trained systems are useless against any other dataset and their performance degrades precipitously once exposed to real-world data, but that doesn't matter, because a team that trains such a system to a new SOTA result will get to publish a paper and claim its academic brownie points, and it's very difficult to argue with tis "results" without beating the benchmarks yourself, because of the publishing climate described in Hinton's quote, above.
Of course, if you want to beat the benchmarks yourself, you have two options. One way is to also cheat and throw a bunch of data at the problem without attempting to understand it or really solve it. The other way is to do it the "right" way, to try to design a system that really demonstrates the abilities the benchmark is supposed to be measuring. Which can sometimes be done, but it takes a lot of time and effort. In the case of ARC, that means learning core priors.
If you've read the Chollet paper that introduced ARC, there's a section on "core priors", that the author says a system must possess before it can solve the ARC tasks. Those core priors must be learned (or worse, coded in by hand). So anyone who decides to solve ARC the way its creator intended will need to spend a great deal of time teaching their system core priors, even before they can start on the actual problems. Meanwhile, some big, 30-person team at a large tech corp with a few milllion dollars' budget will be throwing a gigaflop of compute and a few terrabytes of data on the problem and "solving" it without having to learn any "core priors", by finding some way around the dearth of training data, just as was done with Bongard problems and Winograd schemas. Then anyone who has published any earlier results obtained "the hard way" will be left looking like a loser. Why would I want to subject myself to such a humiliation and drag the reputation of my field through the mud?
So, no, I'm not "showing" you anything. Sorry. If you're interested in the results of ILP and MIL systems, or my own work, check out the ILP and MIL literature. Ask me if you want and I can give you some pointers.
__________________
[1] https://en.wikipedia.org/wiki/Bongard_problem
[2] https://en.wikipedia.org/wiki/Winograd_schema_challenge
[3] https://link.springer.com/chapter/10.1007%2F978-3-319-44781-...
See Section 2.1 "The Dataset": "For each class of each problem we generate 20000 training images. We also generate an additional 10000 images per class per problem as a testing set."
P.S. This paper compares GPT-2,3 to Louise (the system I created for my PhD) and also to humans and Magic Haskeller (an Inductive Functional Programming system, learning programs in Haskell) on some programming tasks in P3 (a Turing complete language with only 7 instructions, kind of a precursor of Brainfuck). I'm not affiliated with the authors:
https://proceedings.neurips.cc/paper/2021/hash/0cd6a652ed1f7...
Louise does alright, outperforming all three other systems and humans on the "very high complexity" tasks, but GPT-3 outperforms it on simpler tasks. In truth, Louise can do better than that, especially with a newer version, not available to the authors, and that has improved one-shot learning capabilities. On the other hand, the comparison is unequal: I was able to find the programming tasks and their solutions in a github repository predating the publication of GPT-3, so GPT-3 most likely ingested and memorised them, while it probably also benefited by the many examples of similar programming tasks in Brainfuck, that can be found online. Louise, as usual, learned in true one-shot fashion, without having seen examples of the tasks or their solutions before.