What does HackerNews think of CLIP?

Contrastive Language-Image Pretraining

Language: Jupyter Notebook

>I guess the image input is just too expensive to run or it's actually not as great as they hyped it.

We already know they have a SOTA model that can turn images into latent space vectors without being some insane resource hog - in fact, they give it away to competitors like Stability. [0]

My guess is a limited set of people are using the GPT-4 with CLIP hybrid, but those use-cases are mostly trying to decipher pictures of text (which it would be very bad at), so they're working on that (or other use-case problems).

[0]https://github.com/openai/CLIP

> What's missing from these models is everything related to visual or spatial information (that is not encoded in text). I assume that there will be eventually be something like ChatGPT/InstructGPT where part of the input data is images and or videos, with and without captions. So it would have a way of connecting the language to the spatial (and temporal).

Is that missing though? Dall-E is the next tab over. You can image to CLIP, and image to image is not just visual, it involves that language to spatial and visual step:

"CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3."

https://github.com/openai/CLIP

Folks who've been playing with Dall-E and getting coherent images seem to be adept at prompting GPT-3 and getting coherent answers.

OpenAI announced and released CLIP on GitHub on January 5, 2021: https://github.com/openai/CLIP

You need CLIP to have CLIP guided diffusion. So the current situation seems to trace back to OpenAI and the MIT-licensed code they released the day DALL-E was announced. I would love to be corrected if I've misunderstood the situation.

Thanks. The next step would be combining it with text-image foundation models such as clip https://github.com/openai/CLIP so that the model no longer depends on a limited set of predefined labels (coco…), right?

Also occlusion inference would be fantastic, so that we can select between the visible parts of the object and the whole shape (behind trees etc).

Exciting decade.

I agree having a good vector is important to start with. However this is not very hard to make it work, you only need to finetune some of the clip models[1] to run it well.

Disclose: I have built a vector search engine to proof this idea[2]

[1]: https://github.com/openai/CLIP

[2]: https://unisearch.cc/search?q=how+to+reset+mac+ssd