Transformers aren't really a wonderful architecture in the sense of a great fit between the architecture and what we know about the task. (For comparison, I think convolutional networks are.)

What makes Transformers great is:

1. Can handle long sequences without large increase in number of parameters to be trained.

2. Parallelize better than previous sequence models, ie LSTM. If we could train LSTMs of the size and with the same training data size as current Transformers, they'd probably be just as good.

So maybe RWKV [1] is the next step. It parallelizes even better and seems to have no sequence limit.

[1] https://github.com/BlinkDL/RWKV-LM