On a technical level, they're doing something really simple -- take BLIP2's ViT-L+Q-former, connect it to Vicuna-13B with a linear layer, and train just the tiny layer on some datasets of image-text pairs.
But the results are pretty amazing. It completely knocks Openflamingo && even the original blip2 models out of the park. And best of all, it arrived before OpenAI's GPT-4 Image Modality did. Real win for Open Source AI.
The repo's default inference code is kind of bad -- vicuna is loaded in fp16 so it can't fit on any consumer hardware. I created a PR on the repo to load it with int8, so hopefully by tomorrow it'll be runnable by 3090/4090 users.
I also developed a toy discord bot (https://github.com/152334H/MiniGPT-4-discord-bot) to show the model to some people, but inference is very slow so I doubt I'll be hosting it publicly.
> they're doing something really simple -- take BLIP2's ViT-L+Q-former, connect it to Vicuna-13B with a linear layer, and train just the tiny layer on some datasets of image-text pairs
Oh yes. Simple! Jesus, this ML stuff makes a humble web dev like myself feel like a dog trying to read Tolstoy.
In practice, it's a lot more like web dev than you might imagine.
The above means that the approach is web-dev like gluing, almost literally just,
from existingliba import someop
from existinglibb import anotherop
from someaifw import glue
a = someop(X)
b = glue(a)
Y = anotherop(b)
And just like webdev, each of those were done in a different platform and require arcane incantations and 5h of doc perusing to make it work on your system.
You can just ask GPT how to do it. Much like a lot of web dev!
at some point someone makes a service where you can let AI take over your computer directly. Easier that way! Curling straight to shell taken to next level.