Everyone pointing out how LLMs fail at some relatively simple tasks are fundamentally misunderstanding the utility of LLMs.

Don't think of an LLM as a full "computer" or "brain". Think of it like a CPU. Your CPU can't run whole programs, it runs single instructions. The rest of the computer built around the CPU gives it the ability to run programs.

Think of the LLM like a neural CPU whose instructions are relatively simple English commands. Wrap the LLM in a script that executes commands in a recursive fashion.

Yes, you can get the LLM to do complicated things in a single pass, this is a testament to the sheer size and massive training set of GPT3 and its ilk. But even with GPT3 you will have more success with wrapper programs structured like:

    premise = gpt3("write an award winning movie premise)
    loop 5 times:
        critique = gpt3("write a critique of the premise", premise)
        premise = gpt3("rewrite the premise taking into account the critique", premise, critique)
    print(premise)
This program breaks down the task of writing a good premise into a cycle of writing/critique/rewriting. You will get better premises this way than if you just expect the model to output one on the first go.

You can somewhat emulate a few layers of this without wrapper code by giving it a sequence of commands, like "Write a movie premise, then write a critique of the movie premise, then rewrite the premise taking into account the critique".

The model is just trained to take in some text and predict the next word (token, really, but same idea). Its training data is a copy of a large swath of the internet. When humans write, they have the advantage of thinking in a recursive fashion offline, then writing. They often edit and rewrite before posting. GPT's training process can't see any of this out-of-text process.

This is why it's not great at logical reasoning problems without careful prompting. Humans tend to write text in the format "". So GPT, being trained on human writing, is trained to emit a conclusion first. But humans don't think this way, they just write this way. But GPT doesn't have the advantage of offline thinking. So it often will state bullshit conclusions first, and then conjure up supporting arguments for it.

GPT's output is like if you ask a human to start writing without the ability to press the backspace key. It doesn't even have a cognitive idea that such a process exists due to its architecture and training.

To extract best results, you have to bolt on this "recursive thinking process" manually. For simple problems, you can do this without a wrapper script with just careful prompting. I.e. for math/logic problems, tell it solve the problem and show its work along the way. It will do better since this forces it to "think through" the problem rather than just stating a conclusion first.

This makes me wonder if GPT could be any good at defining its own control flow. E.g. asking it to to write a python script that uses control structures along with calls to GPT to synthesize coherent content. Maybe it could give itself a kind of working memory.

Libraries such as https://github.com/hwchase17/langchain allow for easy programmatic pipelines of GPT "programs". So you could imagine taking a few hundred of these programs written by humans for various tasks, as are sure to come into existence in the next year or two, then adding those programs to the training data and training a new GPT that knows how to write programs that call itself.