Skimming it, there are a few things about this explanation that rub me just slightly the wrong way.

1. Calling the input token sequence a "command". It probably only makes sense to think of this as a "command" on a model that's been fine-tuned to treat it as such.

2. Skipping over BPE as part of tokenization - but almost every transformer explainer does this, I guess.

3. Describing transformers as using a "word embedding". I'm actually not aware of any transformers that use actual word embeddings, except the ones that incidentally fall out of other tokenization approaches sometimes.

4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.

5. "what attention does is it moves the words in a sentence (or piece of text) closer in the word embedding" No, that's just incorrect.

6. You don't actually need a softmax layer at the end, since here they're just picking the top token and they can just do that pre-softmax since it won't change. It's also weird how they talked about this here when the most prominent use of softmax in transformers is actually in the attention component.

7. Really shortchanges the feedforward component. It may be simple, but it's really important to making the whole thing work.

8. Nothing about the residual

I know we don't have access to the details at OpenAI - but it does seem like there have been significant changes to the BPE token size over time. It seems there is a push towards much larger tokens than the previous ~3 char tokens (at least by behavior)

OpenAI have made their tokenizers public [1].

As someone has pointed out, with BPE you specify the vocab size, not the token size. It's a relatively simple algo, this Huggingface course does a nice job of explaining it [2]. Plus the original paper has a very readable Python example [3].

[1] https://github.com/openai/tiktoken

[2] https://huggingface.co/course/chapter6/5?fw=pt

[3] https://arxiv.org/abs/1508.07909