What does HackerNews think of open_clip?

An open source implementation of CLIP.

Language: Jupyter Notebook

#33 in Deep learning
#11 in R
https://replicate.com/pharmapsychotic/clip-interrogator

using:

cfg.apply_low_vram_defaults()

interrogate_fast()

I tried lighter models like vit32/laion400 and others etc all are very very slow to load or use (model list: https://github.com/mlfoundations/open_clip)

I'm desperately looking for something more modest and light.

Have you looked at LAION-400M? And the OpenCLIP [1] people have replicated CLIP performance using LAION-400M.

[1] https://github.com/mlfoundations/open_clip

> Writing a training loop for CLIP manually wound up with me banging against all sorts of strange roadblocks and missing bits of documentation, and I still don't have it working.

There is working training code for openCLIP https://github.com/mlfoundations/open_clip

But training multi-modal text-to-image models is still a _very_ new thing, in terms of the software world. Given that, my experience has been that it's never been easier to get to work on this stuff from the software POV. The hardware is the tricky bit (and preventing bandwidth issues on distributed systems).

That isn't to say that there isn't code out there for training. Just that you're going to run into issues and learning how to solve those issues as you encounter them is going to be a highly valuable skill soon.

edit:

I'm seeing in a sibling comment that you're hoping to train your own model from scratch on a single GPU. Currently, at least, scaling laws for transformers [0] mean that the only models that perform much of anything at all need a lot of parameters. The bigger the better - as far as we can tell.

Very simply - researchers start by making a model big enough to fill a single GPU. Then, they replicate the model across hundreds/thousands of GPU's, but feed each on a different set of the data. Model updates are then synchronized, hopefully taking advantage of some sort of pipelining to avoid bottlenecks. This is referred to as data-parallel.

[0] https://www.lesswrong.com/tag/scaling-laws

What sort of compute do you get access to? There's a lot of cool stuff you could do depending on whether or not you have decent GPU's and for how much time you're allowed to experiment on them. Experimentation is fairly fundamental in practice.

There are a lot of pretraining tasks in vision/multimodal that are cool. Largely techniques introduced or refined by OpenAI re-implemented as pytorch open source codebases with varying degrees of success:

- Finetune your own CLIP https://github.com/mlfoundations/open_clip

- Train a (much smaller) DALLE https://github.com/lucidrains/DALLE-pytorch

- Train your own guided diffusion https://colab.research.google.com/drive/1javQRTkALBWLFWnx1K4... (pretty tough, may only be feasible on domain-specific data)

- Train a variational autoencoder (VAE)

- "VQGAN" from Heidelberg https://github.com/CompVis/taming-transformers

- "Discrete VAE", used as the backbone for OpenAI's DALL-E, reimplimented here (and other places) https://github.com/lucidrains/DALLE-pytorch

- "VQVAE2" https://github.com/tgisaturday/dalle-lightning