So, what you're asking for is shared context over multiple prompts, which really isn't what this generation of models is trained for. It's moving the goalposts on the mounted astronaut.

However, there is progress towards what you're asking for. The recent work on textual inversion is in the right direction: https://github.com/hlky/sd-enable-textual-inversion

It creates a representation of an entity and allows rending it in different styles and contexts. Currently it involves model fine tuning, but I expect it will become convenient as the power of the operation becomes clear. And once it's convenient, you'll be able to do the progressive queries you're asking for (and it'll be a lot easier to create narratively coherent sets of images.)