Related:

Turing Machines Are Recurrent Neural Networks (1996) - https://news.ycombinator.com/item?id=10930559 - Jan 2016 (12 comments)

As a side note, it's fascinating to read the comments on that thread when talking about RNNs and Deep Learning. So much has changed in the last 6 years and feels so weird to read dismissive comments about the capabilities of these systems seeing what people are getting out of ChatGPT.

ChatGPT hasn't overcome any of the fundamental issues, it's just a huge improvement on the things that the original GPTs were good at. Being able to stay coherent for a trained-in length that gets longer with larger models is different from the length-unlimited coherence that human beings can manage, spanning lifetimes of thought and multiple lifetimes of discourse.

In 2016 Transformers didn't exist and the state of the art for neural network based NLP was using LSTMs that had a limit of maybe 100 words at most.

With new implementations like xformers[1] and flash attention[2] it is unclear where the length limit is on modern transformer models.

Flash Attention can currently scale up to 64,000 tokens on an A100.

[1] https://github.com/facebookresearch/xformers/blob/main/HOWTO...

[2] https://github.com/HazyResearch/flash-attention