Tokenizers are the components that convert raw text into the numerical token sequences that large language models process. Different tokenization algorithms and vocabularies produce significantly different token counts for the same input text, directly impacting cost, context window utilization, and multilingual performance. 1)
BPE is the most widely used subword tokenization method. It builds a vocabulary by iteratively merging the most frequent adjacent byte pairs in training text. 2)
The algorithm works as follows:
BPE creates compact representations for common words while handling rare words via subword decomposition (e.g., “tokenization” becomes “token” + “ization”). It is used in GPT models via tiktoken.
SentencePiece treats tokenization as a probabilistic segmentation problem rather than a merge-based one. 3)
SentencePiece is used in T5, Llama, and several Google models.
tiktoken is OpenAI's fast, Rust-based BPE tokenizer library supporting multiple encodings. It handles regex preprocessing and is approximately 10x faster than pure Python BPE implementations. 4)
Key encodings:
| Model Family | Algorithm | Vocabulary Size | Context Window | Notes |
|---|---|---|---|---|
| GPT-4 | BPE (cl100k_base) | 100,421 | 128K | English-optimized |
| GPT-4o / GPT-4.1 / o3 | BPE (o200k_base) | ~200,000 | 128K-1M | Better multilingual and code support |
| Claude | Custom BPE variant | ~200,000 | 200K | Optimized for long documents |
| Llama 3/4 | SentencePiece (Unigram) | 128K-256K | Up to 10M (Llama 4 Scout) | Byte-fallback for multilingual |
| Gemini | Custom | 256K+ | 1M-2M | Multimodal-tuned |
The same text produces different token counts across models due to vocabulary size, training data, and algorithm differences. Typical variance is 10-50%. 5)
| Text Type (1K chars) | GPT-4 (cl100k) | GPT-4o (o200k) | Claude | Llama 4 |
|---|---|---|---|---|
| English prose | ~250 tokens | ~210 tokens | ~230 tokens | ~220 tokens |
| Source code | ~300 tokens | ~260 tokens | ~270 tokens | ~240 tokens |
| Multilingual (mixed) | ~350 tokens | ~300 tokens | ~280 tokens | ~260 tokens |
Early English-centric tokenizers like cl100k_base use 2-4x more tokens for Asian languages due to byte-level splitting. Modern tokenizers (o200k_base, Llama, Gemini) dedicate vocabulary entries to non-Latin scripts, significantly reducing this overhead. 6)
For example, Chinese characters may require 4-6 tokens with cl100k_base but only 1-2 tokens with o200k_base or SentencePiece-based tokenizers. This means non-English prompts can cost 1.5-3x more with older tokenizers.
Token count directly scales API costs. A 100K-token prompt at $2/M input tokens costs $0.20. More efficient tokenizers reduce both cost and context window consumption: 7)