I have been toying around with Stable Diffusion for a while now and becoming comfortable with the enormous community filled with textual inversions, LoRAs, hyper networks and checkpoints. You can get things with names like “chill blend”, a fine-tuned model on top of the SD with the author’s personal style.

There is something called automatic1111 which is a pretty comprehensive web UI for managing all these moving parts. Filled to the brim with extensions to handle AI upscaling, inpainting, outpainting, etc.

One of these is ControlNet where you can generate new images based on pose info extracted from an existing image or edited by yourself in the web based 3d editor (integrated, of course). Not just pose but depth maps, etc. All with a few clicks.

The level of detail and sheer amount of stuff is ridiculous and it all has meaning and substantial impact on the end result. I have not even talked about the prompting. You can do stuff like [cow:dog:.25] where the generator will start with a cow and then switch over at 25% of the process to a dog. You can use parens like ((sunglasses)) to focus extra hard on that concept.

There are so called LoRAs trained on specific styles and/or characters. These are usually like 5-100MB and work unreasonably well.

You can switch over to the base model easily and the original SD results are 80s arcade game vs GTA5. This stuff has been around for like a year. This is ridiculous.

LLMs are enormously “undertooled”. Give it a year or so.

My point by the way is that any quality issues in the open source models will be fixed and then some.

Local LLMs already have a UI intentionally similar to AUTOMATIC1111, including LoRAs, training with checkpoints, various extensions including multimodal and experimental long-term memory etc.

https://github.com/oobabooga/text-generation-webui