I'm impressed by all of these image generators but I still don't see them working toward being able to say, "Give me an astronaut riding a horse. Ok, now the same location where he arrives at a rocket. Now one where he dismounts. Now the horse runs away as the astronaut enters the rocket."

You can ask for all those things but the AI still has no idea what it's doing and cannot tell you where the astronaut is, etc.

So, what you're asking for is shared context over multiple prompts, which really isn't what this generation of models is trained for. It's moving the goalposts on the mounted astronaut.

However, there is progress towards what you're asking for. The recent work on textual inversion is in the right direction: https://github.com/hlky/sd-enable-textual-inversion

It creates a representation of an entity and allows rending it in different styles and contexts. Currently it involves model fine tuning, but I expect it will become convenient as the power of the operation becomes clear. And once it's convenient, you'll be able to do the progressive queries you're asking for (and it'll be a lot easier to create narratively coherent sets of images.)