> A helpful rule of thumb is that one token generally corresponds to ~4 characters of text for common English text. This translates to roughly 3/4 of a word (so 100 tokens ~= 75 words).
Just for fun I tried entering in "pneumonoultramicroscopicsilicovolcanoconiosis" and "antidisestablishmentarianism". The first was pretty evenly split into tokens of length 1-5 characters, but the second put all of "establishment" into a single token.
No useful conclusions drawn, but it was an interesting test.
I desperately want to be able to get a concrete amount of tokens for my prompt before making a call - things like this make it very hard to request the right amount of max_tokens from longer prompt/generation pairs.