For example, you can use Stable Diffusion with 'ControlNet' [1] where for example, you can input an 'openpose' to choose the pose of people in the scene.
There's also a 'Regional Prompter' [2] which lets you use different prompts for different areas of the image, giving you some control over the composition.
You can also use 'inpainting' to regenerate select parts of your image if, for example, you don't like the shape of the clouds.
Of course this stuff isn't perfect - for example, you'll get hands with the wrong number of fingers sometimes, no matter what you specify. And you can't easily generate things like multi-frame cartoons without characters clothes changing between frames.
[1] https://github.com/Mikubill/sd-webui-controlnet [2] https://github.com/hako-mikan/sd-webui-regional-prompter
But for people who like it, the ControlNet scribble model (and the other ControlNet models, depth-map based, pose control, edge detection, etc.) [0] are supported in the ControlNet extension [1] to the A1111 Stable Diffusion Web UI [2], and probably similar extensions for other popular stable diffusion UIs. Should work in any current browser, and at least the A1111 UI, with ControlNet models, works on machines with as little as 4GB VRAM.
[0] home repo: https://huggingface.co/lllyasviel/ControlNet but for WebUI you probably want the ones linked from the readme of the WebUI ControlNet extensions [1]
[1] https://github.com/Mikubill/sd-webui-controlnet (EDIT: even if you aren’t using the A1111 WebUI, this repo has a nice set of examples of what each of the ControlNet models does, so it may be worth checking out.)