Table of Contents

Tokenizer Comparison

Tokenizers are the components that convert raw text into the numerical token sequences that large language models process. Different tokenization algorithms and vocabularies produce significantly different token counts for the same input text, directly impacting cost, context window utilization, and multilingual performance. 1)

Byte Pair Encoding (BPE)

BPE is the most widely used subword tokenization method. It builds a vocabulary by iteratively merging the most frequent adjacent byte pairs in training text. 2)

The algorithm works as follows:

  1. Initialization: Represent the corpus as a list of individual bytes (UTF-8 encoded characters)
  2. Frequency counting: Identify the most common adjacent byte pairs (e.g., “t” + “h” becomes “th”)
  3. Merging: Replace all occurrences of that pair with a new token in the vocabulary; repeat for a fixed number of merges (typically 30,000 to 100,000+ steps)
  4. Encoding: To tokenize new text, greedily apply merges from most to least frequent, falling back to individual bytes for unknown sequences

BPE creates compact representations for common words while handling rare words via subword decomposition (e.g., “tokenization” becomes “token” + “ization”). It is used in GPT models via tiktoken.

SentencePiece (Unigram Model)

SentencePiece treats tokenization as a probabilistic segmentation problem rather than a merge-based one. 3)

SentencePiece is used in T5, Llama, and several Google models.

tiktoken (OpenAI)

tiktoken is OpenAI's fast, Rust-based BPE tokenizer library supporting multiple encodings. It handles regex preprocessing and is approximately 10x faster than pure Python BPE implementations. 4)

Key encodings:

Model-Specific Tokenizers

Model Family Algorithm Vocabulary Size Context Window Notes
GPT-4 BPE (cl100k_base) 100,421 128K English-optimized
GPT-4o / GPT-4.1 / o3 BPE (o200k_base) ~200,000 128K-1M Better multilingual and code support
Claude Custom BPE variant ~200,000 200K Optimized for long documents
Llama 3/4 SentencePiece (Unigram) 128K-256K Up to 10M (Llama 4 Scout) Byte-fallback for multilingual
Gemini Custom 256K+ 1M-2M Multimodal-tuned

Token Count Comparisons

The same text produces different token counts across models due to vocabulary size, training data, and algorithm differences. Typical variance is 10-50%. 5)

Text Type (1K chars) GPT-4 (cl100k) GPT-4o (o200k) Claude Llama 4
English prose ~250 tokens ~210 tokens ~230 tokens ~220 tokens
Source code ~300 tokens ~260 tokens ~270 tokens ~240 tokens
Multilingual (mixed) ~350 tokens ~300 tokens ~280 tokens ~260 tokens

Multilingual Tokenization

Early English-centric tokenizers like cl100k_base use 2-4x more tokens for Asian languages due to byte-level splitting. Modern tokenizers (o200k_base, Llama, Gemini) dedicate vocabulary entries to non-Latin scripts, significantly reducing this overhead. 6)

For example, Chinese characters may require 4-6 tokens with cl100k_base but only 1-2 tokens with o200k_base or SentencePiece-based tokenizers. This means non-English prompts can cost 1.5-3x more with older tokenizers.

Impact on Cost and Context

Token count directly scales API costs. A 100K-token prompt at $2/M input tokens costs $0.20. More efficient tokenizers reduce both cost and context window consumption: 7)

See Also

References