I'm impressed by all of these image generators but I still don't see them working toward being able to say, "Give me an astronaut riding a horse. Ok, now the same location where he arrives at a rocket. Now one where he dismounts. Now the horse runs away as the astronaut enters the rocket."
You can ask for all those things but the AI still has no idea what it's doing and cannot tell you where the astronaut is, etc.
However, there is progress towards what you're asking for. The recent work on textual inversion is in the right direction: https://github.com/hlky/sd-enable-textual-inversion
It creates a representation of an entity and allows rending it in different styles and contexts. Currently it involves model fine tuning, but I expect it will become convenient as the power of the operation becomes clear. And once it's convenient, you'll be able to do the progressive queries you're asking for (and it'll be a lot easier to create narratively coherent sets of images.)