On a technical level, they're doing something really simple -- take BLIP2's ViT-L+Q-former, connect it to Vicuna-13B with a linear layer, and train just the tiny layer on some datasets of image-text pairs.

But the results are pretty amazing. It completely knocks Openflamingo && even the original blip2 models out of the park. And best of all, it arrived before OpenAI's GPT-4 Image Modality did. Real win for Open Source AI.

The repo's default inference code is kind of bad -- vicuna is loaded in fp16 so it can't fit on any consumer hardware. I created a PR on the repo to load it with int8, so hopefully by tomorrow it'll be runnable by 3090/4090 users.

I also developed a toy discord bot (https://github.com/152334H/MiniGPT-4-discord-bot) to show the model to some people, but inference is very slow so I doubt I'll be hosting it publicly.

> they're doing something really simple -- take BLIP2's ViT-L+Q-former, connect it to Vicuna-13B with a linear layer, and train just the tiny layer on some datasets of image-text pairs

Oh yes. Simple! Jesus, this ML stuff makes a humble web dev like myself feel like a dog trying to read Tolstoy.

In practice, it's a lot more like web dev than you might imagine.

The above means that the approach is web-dev like gluing, almost literally just,

    from existingliba import someop
    from existinglibb import anotherop
    from someaifw import glue

    a = someop(X)
    b = glue(a)
    Y = anotherop(b)

And just like webdev, each of those were done in a different platform and require arcane incantations and 5h of doc perusing to make it work on your system.

You can just ask GPT how to do it. Much like a lot of web dev!

at some point someone makes a service where you can let AI take over your computer directly. Easier that way! Curling straight to shell taken to next level.

So...AutoGPT? Now with command-line access! Have fun :)

https://github.com/Significant-Gravitas/Auto-GPT/