Have you ever wondered how much a 128k context length is in a Large Language Model (LLM)?
That’s why I used an available version of The Lord of the Rings and counted the tokens.
You can check out the token lengths here
I assume they chose 128k for marketing reasons, as context lengths are usually powers of 2.
The 128k tokens context length in GPT-4 is an impressive leap, showcasing just how far we’ve come in processing vast amounts of text. However, even with this extended capacity, it falls short of accommodating the full text of a book like The Lord of the Rings. This limitation highlights the need for innovative strategies to handle longer texts, such as Retrieval-Augmented Generation (RAG) or other advanced retrieval solutions.
GPT-2 had a token size of up 1024.
2^12 = 512 tokens
2^13 = 1024 tokens
GPT-3 had a token size of up 2048.
2^14 = 2048 tokens
GPT-3.5-Turbo had a token size of up 4096.
2^15 = 4096 tokens
GPT4 had a token size of up 128k.
2^16 = 8192 tokens
2^17 = 16384 tokens
2^18 = 32768 tokens
2^19 = 65536 tokens
2^20 = 131072 tokens
Tokenizer o200k_base
Whats also interesting is the tokenizer used for training the model. The one from GPT-4 is called gpt4-tokenizer
and is a byte pair encoding tokenizer.
As we know LLMs work as next token predictor. The model predicts the next token based on the previous tokens.
In this file o200k_base you can see the tokens the tokens used for this procedure.
Be aware that every model has a different tokenizer.
All the above mentioned context lengths are reffered to the o200k_base tokenizer.
Enjoy! ❤️