If it's using a RAIL license isn't it not open source?

Yeah, that's a fair critique, I think the short answer is depends who you ask.

See this FAQ here: https://www.licenses.ai/faq-2

Specifically:

Q: "Are OpenRAILs considered open source licenses according to the Open Source Definition? NO."

A: "THESE ARE NOT OPEN SOURCE LICENSES, based on the definition used by Open Source Initiative, because it has some restrictions on the use of the licensed AI artifact.

That said, we consider OpenRAIL licenses to be “open”. OpenRAIL enables reuse, distribution, commercialization, and adaptation as long as the artifact is not being applied for use-cases that have been restricted.

Our main aim is not to evangelize what is open and what is not but rather to focus on the intersection between open and responsible licensing."

FWIW, there's a lot of active discussion in this space, and it could be the case that e.g. communities settle on releasing code under OSI-approved licenses and models/artifacts under lowercase "open" but use-restricted licenses.

Fair enough. "Source available" would be better than "open source" in this case, to avoid misleading people. (You do want them to read the terms.)

I'm not familiar with machine learning.

But, I'm familiar with poking around in source code repos!

I found this https://huggingface.co/openjourney/openjourney/blob/main/tex... . It's a giant binary file. A big binary blob.

(The format of the blob is python's "pickle" format: a binary serialization of an in-memory object, used to store an in-memory object and later load it, perhaps on a different machine.)

But, I did not find any source code for generating that file. Am I missing something?

Shouldn't there at least be a list of input images, etc and some script that uses them to train the model?

Hahahahaha you sweet summer child. Training code? For an art generator?!

Yeah, no. Nobody in the AI community actually provides training code. If you want to train from scratch you'll need to understand what their model architecture is, collect your own dataset, and write your own training loop.

The closest I've come across is code for training an unconditional U-Net; those just take an image and denoise/draw it. CLIP also has its own training code - though everyone just seems to use OpenAI CLIP[0]. You'll need to figure out how to write a Diffusers pipeline that lets you combine CLIP and a U-Net together, and then alter the U-Net training code to feed CLIP vectors into the model, etc. Stable Diffusion also uses a Variational Autoencoder in front of the U-Net to get higher resolution and training performance, which I've yet to figure out how to train.

The blob you are looking at is the actual model weights. For you see, AI is proprietary software's final form. Software so proprietary that not even the creators are allowed to see the source code. Because there is no source code. Just piles and piles of linear algebra, nonlinear activation functions, and calculus.

For the record, I am trying to train-from-scratch an image generator using public domain data sources[1]. It is not going well: after adding more images it seems to have gotten significantly dumber, with or without a from-scratch trained CLIP.

[0] I think Google Imagen is using BERT actually

[1] Specifically, the PD-Art-old-100 category on Wikimedia Commons.

Have you looked at LAION-400M? And the OpenCLIP [1] people have replicated CLIP performance using LAION-400M.

[1] https://github.com/mlfoundations/open_clip