Yeah, I think the (worrying) confusion is that Amazon calls it a seq2seq model, which was the name of a SOTA RNN from Google a while back.

Ofc now, seq2seq just means what you said (an encoder/decoder model, which is actually what a “truly vanilla” transformer would be anyway).

The fact that any serious researcher thinks any other serious researchers are using models without self attention is the real red flag here.

No one is trying to use other models anymore because they do not scale. There’s enough variety within transformers that you could argue we need a new level of taxonomy, but transformers are basically it for now.

That's actually not true: https://github.com/BlinkDL/RWKV-LM