I was always wodering how much 128k context length is in a Large Language Model (LLM).
Thats why I took an available version of Lord of the Rings and counted the tokens.
You can check out the token lengths here
I assume they made 128k for maketing reasons. Usually the context lengths were powers of 2.
GPT-2 had a token size of up 1024.
2^12 = 512 tokens
2^13 = 1024 tokens
GPT-3 had a token size of up 2048.
2^14 = 2048 tokens
GPT-3.5-Turbo had a token size of up 4096.
2^15 = 4096 tokens
GPT4 had a token size of up 128k.
2^16 = 8192 tokens
2^17 = 16384 tokens
2^18 = 32768 tokens
2^19 = 65536 tokens
2^20 = 131072 tokens
Tokenizer o200k_base
Whats also interesting is the tokenizer used for training the model. The one from GPT-4 is called gpt4-tokenizer
and is a byte pair encoding tokenizer.
As we know LLMs work as next token predictor. The model predicts the next token based on the previous tokens.
In this file o200k_base you can see the tokens the tokens used for this procedure.
Be aware that every model has a different tokenizer.
All the above mentioned context lengths are reffered to the o200k_base tokenizer.
Enjoy! ❤️