"Numbers talk, bullshit walks." (L. Torvalds)

Let's see the performance of ART on some benchmark tasks, then we can talk.

I appreciate that you're saying something that you think is only common sense, but I suspect that is most likely because 99% of the machine learning papers you've seen do nothing else but claim a new SOTA result on some set of benchmarks. Yet just because that's what (almost) everyone is doing doesn't mean it's good research. Quite the contrary, that's the reason why machine learning research has become a quagmire and progress has stalled.

Here's Geoff Hinton on benchmarks (again; I cite that all the time):

GH: One big challenge the community faces is that if you want to get a paper published in machine learning now it's got to have a table in it, with all these different data sets across the top, and all these different methods along the side, and your method has to look like the best one. If it doesn’t look like that, it’s hard to get published. I don't think that's encouraging people to think about radically new ideas.

Now if you send in a paper that has a radically new idea, there's no chance in hell it will get accepted, because it's going to get some junior reviewer who doesn't understand it. Or it’s going to get a senior reviewer who's trying to review too many papers and doesn't understand it first time round and assumes it must be nonsense. Anything that makes the brain hurt is not going to get accepted. And I think that's really bad.

What we should be going for, particularly in the basic science conferences, is radically new ideas. Because we know a radically new idea in the long run is going to be much more influential than a tiny improvement. That's I think the main downside of the fact that we've got this inversion now, where you've got a few senior guys and a gazillion young guys.

https://www.wired.com/story/googles-ai-guru-computers-think-...

And here's a recent paper noting yet another way machine learning benchmarks are borked:

https://arxiv.org/abs/2003.08907

P.S. And, please leave that aggressive attitude out of research discussions ("bullshit walks"?). Science is not the constant comparison of dick measurements. That's perhaps the point of sports, but anything that stifles debate and inhibits the development of new ideas is anti-science.

How has the progress stalled?

Just last year there was a breakthrough in decades old problem of protein-folding. The year before that we figured out that large language models have astonishing few shot learning abilities. CLIP introduced strong 0-shot capabilities to image recognition. Progress in image generation from descriptions has been nothing but astonishing with Dall-E and GLIDE from OpenAI.

Not to mention breakthroughs in game theoretical problems like Counterfactual Regret Minimization and more recent application of Deep Learning to make CFR practical in actual games. Including continuous progress in the AlphaGo linage to AlphaZero, MuZero, and last year EfficientZero and Player of Games.

With the exception of counterfactual regret minimisation, which has nothing to do with deep learning, all those are applications of existing approaches that have become possible because of the increased expenditure of resources- processing power and data. There have been no theoretical advances, no algorithmic advances to speak of, no new knowledge learned in the last 20 years of reserach in deep learning. All that has been achieved is new SOTA on old benchmarks [1].

To make an analogy it's as if we could sort ever larger lists because we keep sorting them with ever bigger computers... but the only sorting algorithm we know is bubblesort.

You bring up large language models. Language models are about as old as computational linguistics [6]. Large language models are... large. There's nothing that's new about them. "Attention" is an anthropomorphically misnamed engineering tweak of the kind that was rightly lambasted by Drew McDermot fourty years ago [7]. Self-play and reinforcement Learning are now more than 60 years old [8].

As to "few shot" learning, that is said for a language model previously trained on a copy of the entire web! In what sense is that "few shot"? It is truly shocking to hear researchers make such obviously false claims with a straight face. And that they are rewarded with publication of their work is nothing less than a scandal.

Deep learning research is going nowhere. It is stuck in a rut. It is dead. It has given up the spirit. It is an ex-research field. This research field is a goner. It is dead, jim. And I'm sure it will be decades before people start to say it out loud (the current generation of reserachers will leave the dirty job to their students), but that doesn't change a thing.

______________

[1] Except in identifying new weaknesses of deep learning systems, as for example Overinterpretation in the paper I cite above, adversarial examples [2], shortcut learning [3], "the elephant in the room" [4], being "right for the wrong reasons" [5] etc etc etc etc.

[2] https://arxiv.org/abs/1312.6199

[3] https://www.nature.com/articles/s42256-020-00257-z

[4] https://arxiv.org/abs/1808.03305

[5] https://aclanthology.org/P19-1334.pdf

Which of course are very welcome. The only way to address the limitations of current approaches is to become aware of them and try to understand them. Unfortunately, it is only a tiny, tiny minority in the field who does that, while the majority is happy to spout nonsense like "[a billion plus] few shot" learning etc.

[6] https://www.semanticscholar.org/paper/Speech-and-language-pr...

[7] https://www.semanticscholar.org/paper/Artificial-intelligenc...

[8] Arhtur Samuel's checkers player beat a human champion in 1961. Donald Michie build MENACE, a reinforcement learning algorithm implemented in matchboxes in 1960:

https://rodneybrooks.com/forai-machine-learning-explained/

Ctrl+F for "Machine learning started with games" to find the relevant section.

So much that you write is blatantly false, or stretch to the point of being as good as false.

But you seem well read, so I don't think there is a point in trying to convince you.

It's also interesting coming from someone researching Inductive Logic Programming that provides no results at all.

From my perspective, I don't care for theoretical advances. I care about results and if scale provides results, I'm happy.

For people who are just reading this exchange take a look here:

https://pbs.twimg.com/media/FHHPjU0VIAUrSMA?format=jpg&name=...

It's 16 computer generated images from the description below each image. Having capability like that is just very useful and fun.

I wish you also luck in finding a "cat paying checkers in style of Salvador Dali" or a "psychedelic painting of a hamster dragon" on the internet.

There is a lot of stuff on the web, but not everything.

>> From my perspective, I don't care for theoretical advances. I care about results and if scale provides results, I'm happy.

Like I say you can sort ever larger lists with ever bigger computers and bubblesort. But you can go a lot further if you use the brain and come up with a better sorting algorithm.

But I think you would be really surprised to find out how much humans have achieved by using their brains rather than ever bigger computers. Here's a small example that happened to be on my mind recently:

https://en.wikipedia.org/wiki/Copernican_Revolution

That is an example of the difference between expensive toys like Dall-E and world-changing scientific work. Though I appreciate that the difference can sometimes appear somewhat mudddled and a machine that can generate a "hamster dragon" may seem as an important scientific breakthrough. Depends on what you're more interested in: hamster dragons, or understanding how the world works. Oh well.

As to this:

>> It's also interesting coming from someone researching Inductive Logic Programming that provides no results at all.

That's an obvious -and odious- attempt at trolling me, but unlike yours, my HN profile is linked to my real-world identity and so I am forced to respond with politenes and to represent my field with dignity. Therefore, I will point out a famous achievement of ILP with which you are sadly unfamiliar and that I expect you to immediately discount on the grounds that you didn't know about it, so it can't have been all that important. I'm linking to an article in the popular press that will be more accessible. It doesn't go into depth over the machine learning approach used but it's ILP; I link to the scholarly article immediately after:

Robot Makes Scientific Discovery All by Itself

For the first time, a robotic system has made a novel scientific discovery with virtually no human intellectual input. Scientists designed “Adam” to carry out the entire scientific process on its own: formulating hypotheses, designing and running experiments, analyzing data, and deciding which experiments to run next.

https://www.wired.com/2009/04/robotscientist/

And the scholarly article:

Functional genomic hypothesis generation and experimentation by a robot scientist

https://www.nature.com/articles/nature02236

>> That is an example of the difference between expensive toys like Dall-E and world-changing scientific work.

There isn't really that much difference. DeepMind took some 3 years to max out CASP protein-folding benchmark running from 1994 on a problem known since ~1960. Protein folding is exactly the type of world-changing scientific work. But it is just a tip of the iceberg. DL is being successfully applied to PDE solving or electron density predictions and more.

The psychedelic hamster dragon is an example of the fact that DL is capable of 0-shot generalization. It is a significant scientific observation in itself.

And with regard to your example. We are discussing whether deep learning has staled. I give you mostly examples from last year like EfficientZero (30 Oct) ~500x improvement in sample efficiency over 2013 DQN, some literally not older than a month: GLIDE (20 Dec), Player of Games (6 Dec).

You are giving me example from 2004 (2009?).

I'm not trying to troll you. I just think you are biased in you assessment of significant results and what it means to be stalled.

>> Protein folding is exactly the type of world-changing scientific work.

Nope. Not at all. That's a misunderstanding of the goal of science which is to explain how the world works. AlphaFold, like all neural network models is a predictive model but has no explanatory power. It can predict the structure of proteins from sequences but it cannot explain why, or how, proteins fold. Scientists can still not explain how proteins fold, certainly not by interacting with AlphaFold.

Why is that a problem? You said you care about results. I linked to the wikipedia article about the Copernican Revolution. I recommend you read the wikipedia article on epicycles [1]. To summarise, for hundreds of years, from Hipparchus of Nicaea, through Claudius Ptolemy and all the way up to Copernicus, astronomers used systems of epicycles (circles-upon-circles) to explain the apparent movements of the planets. Epicycle-based models matched observations very well and predicted the movement of the planets very accuratey because it is always possible to fit a smooth curve with arbitrary precision given a sufficient number of epicycles. However, even Copernicus' model that had the sun in the center of the universe, unlike many earlier models, could not explain why the planets moved the way they did. A true explanation only became available when Isaac Newton developed his laws of universal gravitation.

Newton's theory (itself later superseded by Einstein's theory of general relativity) provided a true explanation of observable phenomena and a way to calculate better models and obtain more accurate predictions than were ever possible before. That is the power of explaining and understanding the world and simply modelling an incomprehensible set of observations, which is all that neural networks can do, doesn't even come close.

>> I'm not trying to troll you. I just think you are biased in you assessment of significant results and what it means to be stalled.

ILP is a tiny field with maybe a few dozen people that consistently publish on the subject. By comparison, there are many tens of thousands of papers published on deep learning every year, the deep learning conferences receive many thousands of submissions and research is funded with many millions of dollars by governments, militaries, and the largest of large technology corporatations, the Googles and Facebooks etc, and training state-of-the-art models requires many terrabytes of data, many petaflops of computing power, to the extent that training such models is now only feasible for the aforementioned large corporations. Normally, I'd complain that you are comparing apples to oranges. How can a few people with meager resources be expected to outdo DeepMind and OpenAI?

Yet the algorithms that I study run on a student laptop, are trained in a weakly supervised manner, take seconds to train, and generalise robustly from single examples without any sort of pre-training whatsoever (for instance, the approach I study, Meta-Interpetive Learning, comfortably and routinely learns recursive prorgams with arbitrary structure from a single example, without an example of the "base-case"). These are capabilities unheard of in neural networks that must be trained on hundreds of millions of examples [2] of actual programs (rather than examples of inputs and outputs, as in ILP) in order to generate programs and cannot solve programming tasks, nor generate solutions, unlike the examples tasks and solutions in their training sets; while in the case of ILP every programming task solved is, by definition unseen (hence, "weakly supervised" because examples are programming tasks without examples of the solutions).

That is what it means for a field to be stalled. When a single graduate student with a seven-year old laptop can solve problems that the dominant approach cannot even attetmpt (in this case, learning from a single example, without pretraining and with weak supervision).

____________

[1] https://en.wikipedia.org/wiki/Deferent_and_epicycle

[2] https://arxiv.org/abs/1703.07469v1

See Section 4.4. Hyperparameters and Training: "Each minibatch was re-sampled so the model saw 256 million random programs and 1024 million random I/O examples during training.".

I'm not a biochemist. I don't know how much knowledge of how protein fold is important as opposed to knowing the final structure. My understanding is that current expensive method of X-ray crystallography in the range of $100k per protein also does not explain how protein fold, however the structure itself appear to be useful.

Many complex problems do not appear to be explainable in the way planet motions are explainable. The fact that many aspects of the universe are explainable in simple laws is rather extraordinary in itself. In reality, there may be just a lot of irreducible complexity. Maybe success of Deep Learning teaches us that certain things do not fit into framework of science that you assume to be the only correct framework. Reality does not care about what we would want to be true.

Maybe invention of AlphaFold will lead to better understanding of protein folding the same way that invention of steam engine lead to the laws of thermodynamics. Historically, science often lags behind progress of technology. Time will judge.

I also thought that the lack of resources will be your answer. In 2004-2009 neural networks were in quite similar position to ILP. However, NN showed significant potential as evaluated by the broad community at the time and so resources allocated to NN started to grow exponentially. ILP and symbolic methods more broadly do not show similarly convincing potential to the broad community. I'm sure OpenAI, DeepMind, Google, Micosoft etc. would not hesitate to dump equivalent resources they do into many other projects if they saw potential. DeepMind, Microsoft and Google especially hire people with very broad range of expertise.

What may be true is that the world is starting to reach limits of the resources that can be dedicated to deep learning. Exponential growth in resource use can not go forever unless deep learning will start feeding back into available resources and we are certainly not there, not yet.

> Yet the algorithms that I study run on a student laptop, are trained in a weakly supervised manner, take seconds to train, and generalise robustly from single examples without any sort of pre-training whatsoever

Show me that applied to: https://github.com/fchollet/ARC

>> Maybe invention of AlphaFold will lead to better understanding of protein folding the same way that invention of steam engine lead to the laws of thermodynamics. Historically, science often lags behind progress of technology. Time will judge.

It's not impossible but I'm concerned that neural networks will only be used as an excuse to not have to understand anything anymore (not just in biology, in all of the sciences). That would be a terrible outcome- a dumbing down of science and an eventual loss of our ability to understand the world. As you say, time will be the judge of that.

>> Show me that applied to: https://github.com/fchollet/ARC

I had a look at ARC back when François Chollet's paper came out that presented it. I was interested because a) it's the kind of problem that ILP eats for breakfast and b) Chollet's name is well-known and it would be a chance to get our work noticed by people who normally wouldn't notice it.

However.

ARC is proposed as a benchmark that is hard for current big-data approaches and that would need elements of intelligent reasoning to solve (if I understand correctly, that's why you brought it up yourself?). Such benchmarks have been proposed before, in particular the Bongard problems [1], in machine vision, and the Winograd schemas [2], in NLP. As with ARC, the "defense" of such datasets against dumb, no-reasoning, big-data approaches is that there are few examples of each problem task. And yet, in both Bongard problems and Winograd schemas, neural nets have now achieved high accuracy. How did they do it? They did it by cheating: instead of training on the original, and few, Bongard problems, for example, people created generators for similar problems that could generate many thousands of problems [3]. That way, the neural nets folks had their big data and they could train their big networks. Same for Winograd schemas.

... And the same thing has already happened, at a preliminary stage, with ARC:

https://arxiv.org/abs/2011.09860

In the paper I link above, the authors use a data augmentation technique that consists of rotations and colour transformations of problem tasks. It should be obvious that this adds no useful information that can help any attempt at solving ARC problems with reasoning, and that it only serves to help a neural net better overfit to the training tasks. And yet, the system in the paper achieves good results in a small selection of ARC tasks. Admittedly, that is a very small selection (only 10x10 grids and not on any of the held-out test set that only François Chollet has access to) but the important point is that it is possible to make progress in solving ARC without actually showing any reasoning ability, despite the claims of the Kolev et al. paper above; and despite Chollet's intent for ARC to avoid exactly that.

This is pretty standard in deep learning and the reason why I gripe about benchmarks in my earlier comment. Deep learning systems consistently beat benchmarks designed to test abilities that deep learning systems don't have. They do this by cheating, enabled by weaknesses in said benchmarks. They cheat by overfitting to surface statistical regularities of the data (see earlier cited papers) thus learning "shortcuts" around the intended difficulties of the benchmarks. The trained systems are useless against any other dataset and their performance degrades precipitously once exposed to real-world data, but that doesn't matter, because a team that trains such a system to a new SOTA result will get to publish a paper and claim its academic brownie points, and it's very difficult to argue with tis "results" without beating the benchmarks yourself, because of the publishing climate described in Hinton's quote, above.

Of course, if you want to beat the benchmarks yourself, you have two options. One way is to also cheat and throw a bunch of data at the problem without attempting to understand it or really solve it. The other way is to do it the "right" way, to try to design a system that really demonstrates the abilities the benchmark is supposed to be measuring. Which can sometimes be done, but it takes a lot of time and effort. In the case of ARC, that means learning core priors.

If you've read the Chollet paper that introduced ARC, there's a section on "core priors", that the author says a system must possess before it can solve the ARC tasks. Those core priors must be learned (or worse, coded in by hand). So anyone who decides to solve ARC the way its creator intended will need to spend a great deal of time teaching their system core priors, even before they can start on the actual problems. Meanwhile, some big, 30-person team at a large tech corp with a few milllion dollars' budget will be throwing a gigaflop of compute and a few terrabytes of data on the problem and "solving" it without having to learn any "core priors", by finding some way around the dearth of training data, just as was done with Bongard problems and Winograd schemas. Then anyone who has published any earlier results obtained "the hard way" will be left looking like a loser. Why would I want to subject myself to such a humiliation and drag the reputation of my field through the mud?

So, no, I'm not "showing" you anything. Sorry. If you're interested in the results of ILP and MIL systems, or my own work, check out the ILP and MIL literature. Ask me if you want and I can give you some pointers.

__________________

[1] https://en.wikipedia.org/wiki/Bongard_problem

[2] https://en.wikipedia.org/wiki/Winograd_schema_challenge

[3] https://link.springer.com/chapter/10.1007%2F978-3-319-44781-...

See Section 2.1 "The Dataset": "For each class of each problem we generate 20000 training images. We also generate an additional 10000 images per class per problem as a testing set."

P.S. This paper compares GPT-2,3 to Louise (the system I created for my PhD) and also to humans and Magic Haskeller (an Inductive Functional Programming system, learning programs in Haskell) on some programming tasks in P3 (a Turing complete language with only 7 instructions, kind of a precursor of Brainfuck). I'm not affiliated with the authors:

https://proceedings.neurips.cc/paper/2021/hash/0cd6a652ed1f7...

Louise does alright, outperforming all three other systems and humans on the "very high complexity" tasks, but GPT-3 outperforms it on simpler tasks. In truth, Louise can do better than that, especially with a newer version, not available to the authors, and that has improved one-shot learning capabilities. On the other hand, the comparison is unequal: I was able to find the programming tasks and their solutions in a github repository predating the publication of GPT-3, so GPT-3 most likely ingested and memorised them, while it probably also benefited by the many examples of similar programming tasks in Brainfuck, that can be found online. Louise, as usual, learned in true one-shot fashion, without having seen examples of the tasks or their solutions before.