My prediction: Either these models will have to expose some editable intermediate step, or they have to become as smart as a human.

From the users point of view, a sentence is turned into an image. But there is an underlying structure to the pixels that we aren't allowed to tinker with.

Many choices are made like, placement, color pallete, level of detail, emotion evoked, or sub image descriptions (What should this bush look like, what kind of cloud should this be?)

These image generators are so useful because they fill in all the details you leave out, but you have no ability to tinker with the intermediate details they choose, and you can't just give them a paragraph of details either.

There are models available that give you more control - in some senses, at least.

For example, you can use Stable Diffusion with 'ControlNet' [1] where for example, you can input an 'openpose' to choose the pose of people in the scene.

There's also a 'Regional Prompter' [2] which lets you use different prompts for different areas of the image, giving you some control over the composition.

You can also use 'inpainting' to regenerate select parts of your image if, for example, you don't like the shape of the clouds.

Of course this stuff isn't perfect - for example, you'll get hands with the wrong number of fingers sometimes, no matter what you specify. And you can't easily generate things like multi-frame cartoons without characters clothes changing between frames.

[1] https://github.com/Mikubill/sd-webui-controlnet [2] https://github.com/hako-mikan/sd-webui-regional-prompter