Text will be better due to simple scale, but the text will still be limited due to the use of a CLIP for text encoding (BPEs+contrastive). So that may be SD XL 0.9 but it should still be worse due to not using T5 like https://github.com/deep-floyd/IF
DeepFloyd?
use https://github.com/deep-floyd/IF, it uses LLM to generate exact art you need.
GitHub: https://github.com/deep-floyd/IF
Colab Notebook for running the model based on the diffusers library: https://colab.research.google.com/github/huggingface/noteboo...
Hugging Face Space for testing the model: https://huggingface.co/spaces/DeepFloyd/IF
Note that the model is substantially more compute-intensive than Stable Diffusion, so it may be slower even though that space is running on an A100.