Byte Pair Encoding (BPE)
SentencePiece (Unigram Model)
tiktoken (OpenAI)
Model-Specific Tokenizers
Token Count Comparisons
Multilingual Tokenization
Impact on Cost and Context
See Also
References

Tokenizer Comparison

Tokenizers are the components that convert raw text into the numerical token sequences that large language models process. Different tokenization algorithms and vocabularies produce significantly different token counts for the same input text, directly impacting cost, context window utilization, and multilingual performance. ¹⁾

Byte Pair Encoding (BPE)

BPE is the most widely used subword tokenization method. It builds a vocabulary by iteratively merging the most frequent adjacent byte pairs in training text. ²⁾

The algorithm works as follows:

Initialization: Represent the corpus as a list of individual bytes (UTF-8 encoded characters)
Frequency counting: Identify the most common adjacent byte pairs (e.g., “t” + “h” becomes “th”)
Merging: Replace all occurrences of that pair with a new token in the vocabulary; repeat for a fixed number of merges (typically 30,000 to 100,000+ steps)
Encoding: To tokenize new text, greedily apply merges from most to least frequent, falling back to individual bytes for unknown sequences

BPE creates compact representations for common words while handling rare words via subword decomposition (e.g., “tokenization” becomes “token” + “ization”). It is used in GPT models via tiktoken.

SentencePiece (Unigram Model)

SentencePiece treats tokenization as a probabilistic segmentation problem rather than a merge-based one. ³⁾

Learns a vocabulary by scoring subword candidates via a unigram probability distribution
Prunes low-probability tokens using an EM (Expectation-Maximization) algorithm to reach a target vocabulary size
Uses the Viterbi algorithm at inference time to find the highest-probability segmentation, allowing backtracking unlike BPE's greedy approach
Language-agnostic (no spaces required), making it ideal for multilingual text

SentencePiece is used in T5, Llama, and several Google models.

tiktoken (OpenAI)

tiktoken is OpenAI's fast, Rust-based BPE tokenizer library supporting multiple encodings. It handles regex preprocessing and is approximately 10x faster than pure Python BPE implementations. ⁴⁾

Key encodings:

cl100k_base: 100,421 vocabulary tokens, used by GPT-4 and GPT-3.5-turbo. Approximately 1 token per 4 English characters.
o200k_base: Approximately 200,000 vocabulary tokens, used by GPT-4o, GPT-4.1, and o-series models. 15-20% more efficient than cl100k_base, especially for non-English text and code.

Model-Specific Tokenizers

Model Family	Algorithm	Vocabulary Size	Context Window	Notes
GPT-4	BPE (cl100k_base)	100,421	128K	English-optimized
GPT-4o / GPT-4.1 / o3	BPE (o200k_base)	~200,000	128K-1M	Better multilingual and code support
Claude	Custom BPE variant	~200,000	200K	Optimized for long documents
Llama 3/4	SentencePiece (Unigram)	128K-256K	Up to 10M (Llama 4 Scout)	Byte-fallback for multilingual
Gemini	Custom	256K+	1M-2M	Multimodal-tuned

Token Count Comparisons

The same text produces different token counts across models due to vocabulary size, training data, and algorithm differences. Typical variance is 10-50%. ⁵⁾

Text Type (1K chars)	GPT-4 (cl100k)	GPT-4o (o200k)	Claude	Llama 4
English prose	~250 tokens	~210 tokens	~230 tokens	~220 tokens
Source code	~300 tokens	~260 tokens	~270 tokens	~240 tokens
Multilingual (mixed)	~350 tokens	~300 tokens	~280 tokens	~260 tokens

Multilingual Tokenization

Early English-centric tokenizers like cl100k_base use 2-4x more tokens for Asian languages due to byte-level splitting. Modern tokenizers (o200k_base, Llama, Gemini) dedicate vocabulary entries to non-Latin scripts, significantly reducing this overhead. ⁶⁾

For example, Chinese characters may require 4-6 tokens with cl100k_base but only 1-2 tokens with o200k_base or SentencePiece-based tokenizers. This means non-English prompts can cost 1.5-3x more with older tokenizers.

Impact on Cost and Context

Token count directly scales API costs. A 100K-token prompt at $2/M input tokens costs $0.20. More efficient tokenizers reduce both cost and context window consumption: ⁷⁾

o200k_base and Llama tokenizers save 15-30% over cl100k_base on mixed workloads
Multilingual savings compound further for global applications
Always test with the model-specific tokenizer for precise token counts before deployment