Offtopic sort of, but does anyone know if folks are working on combining vision and natural language in one model? I think that could wield some interesting results.
https://github.com/openai/CLIP
The results are quite interesting:
https://www.reddit.com/r/Art/comments/p866wv/deep_dive_meai_...