What does HackerNews think of CLIP?
Contrastive Language-Image Pretraining
We already know they have a SOTA model that can turn images into latent space vectors without being some insane resource hog - in fact, they give it away to competitors like Stability. [0]
My guess is a limited set of people are using the GPT-4 with CLIP hybrid, but those use-cases are mostly trying to decipher pictures of text (which it would be very bad at), so they're working on that (or other use-case problems).
Is that missing though? Dall-E is the next tab over. You can image to CLIP, and image to image is not just visual, it involves that language to spatial and visual step:
"CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3."
https://github.com/openai/CLIP
Folks who've been playing with Dall-E and getting coherent images seem to be adept at prompting GPT-3 and getting coherent answers.
You need CLIP to have CLIP guided diffusion. So the current situation seems to trace back to OpenAI and the MIT-licensed code they released the day DALL-E was announced. I would love to be corrected if I've misunderstood the situation.
Also occlusion inference would be fantastic, so that we can select between the visible parts of the object and the whole shape (behind trees etc).
Exciting decade.
Disclose: I have built a vector search engine to proof this idea[2]