What's missing from these models is everything related to visual or spatial information (that is not encoded in text). I assume that there will be eventually be something like ChatGPT/InstructGPT where part of the input data is images and or videos, with and without captions. So it would have a way of connecting the language to the spatial (and temporal).

It seems like they may need a more efficient approach though to handle the massive amount of video data. Maybe the 'MrsFormer' multi-resolution thing could help.

Another thing that could be very useful for coding without requiring visual information would be to add a whole other subsystem where this thing could actually compile/run the code iteratively and see the output.

I don't think transformers are the last invention in AI, but they certainly seem capable of getting to general purpose AI for many contexts. That and related techniques are not going to create something like a digital autonomous person though, which I think is a good thing.

> What's missing from these models is everything related to visual or spatial information (that is not encoded in text). I assume that there will be eventually be something like ChatGPT/InstructGPT where part of the input data is images and or videos, with and without captions. So it would have a way of connecting the language to the spatial (and temporal).

Is that missing though? Dall-E is the next tab over. You can image to CLIP, and image to image is not just visual, it involves that language to spatial and visual step:

"CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3."

https://github.com/openai/CLIP

Folks who've been playing with Dall-E and getting coherent images seem to be adept at prompting GPT-3 and getting coherent answers.