Text will be better due to simple scale, but the text will still be limited due to the use of a CLIP for text encoding (BPEs+contrastive). So that may be SD XL 0.9 but it should still be worse due to not using T5 like https://github.com/deep-floyd/IF