What does HackerNews think of tiktoken?

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

Language: Python

Extracted via the GitHub Repo tiktoken [1]

After you try to decode a string the list on my computer shows up in /tmp/data-gym-cache/9b5ad71b2ce5302211f9c61530b329a4922fc6a4

[1] https://github.com/openai/tiktoken

OpenAI have made their tokenizers public [1].

As someone has pointed out, with BPE you specify the vocab size, not the token size. It's a relatively simple algo, this Huggingface course does a nice job of explaining it [2]. Plus the original paper has a very readable Python example [3].

[1] https://github.com/openai/tiktoken

[2] https://huggingface.co/course/chapter6/5?fw=pt

[3] https://arxiv.org/abs/1508.07909

It's worth noting that this only for GPT-3. If you're using ChatGPT or GPT-4, both use a different tokenizer that's more robust and uses/generates about 10% fewer tokens. (unclear how well it performs for non-English languages)

You can test it offline using tiktoken: https://github.com/openai/tiktoken

Hi folks – I work at OpenAI and helped build this page, awesome to see it on here! Heads up that it's a bit out of date as GPT4 has a different tokenizer than GPT3. I'd recommend checking out tiktoken (https://github.com/openai/tiktoken) or this other excellent app that a community member made (https://tiktokenizer.vercel.app)
OpenAI seems to use Tiktoken [0]. It also covers GPT-4 token encoding.

[0] https://github.com/openai/tiktoken