====== Tokenizer Comparison ======

Tokenizers are the components that convert raw text into the numerical token sequences that large language models process. Different tokenization algorithms and vocabularies produce significantly different token counts for the same input text, directly impacting cost, context window utilization, and multilingual performance. ((https://www.salttechno.ai/datasets/llm-model-comparison-2026/|SaltTechno: LLM Model Comparison 2026))

===== Byte Pair Encoding (BPE) =====

BPE is the most widely used subword tokenization method. It builds a vocabulary by iteratively merging the most frequent adjacent byte pairs in training text. ((https://www.cloudidr.com/blog/llm-pricing-comparison-2026|CloudIDR: LLM Pricing Comparison))

The algorithm works as follows:

  - **Initialization**: Represent the corpus as a list of individual bytes (UTF-8 encoded characters)
  - **Frequency counting**: Identify the most common adjacent byte pairs (e.g., "t" + "h" becomes "th")
  - **Merging**: Replace all occurrences of that pair with a new token in the vocabulary; repeat for a fixed number of merges (typically 30,000 to 100,000+ steps)
  - **Encoding**: To tokenize new text, greedily apply merges from most to least frequent, falling back to individual bytes for unknown sequences

BPE creates compact representations for common words while handling rare words via subword decomposition (e.g., "tokenization" becomes "token" + "ization"). It is used in GPT models via tiktoken.

===== SentencePiece (Unigram Model) =====

SentencePiece treats tokenization as a probabilistic segmentation problem rather than a merge-based one. ((https://www.salttechno.ai/datasets/llm-model-comparison-2026/|SaltTechno: LLM Model Comparison))

  * Learns a vocabulary by scoring subword candidates via a unigram probability distribution
  * Prunes low-probability tokens using an EM (Expectation-Maximization) algorithm to reach a target vocabulary size
  * Uses the Viterbi algorithm at inference time to find the highest-probability segmentation, allowing backtracking unlike BPE's greedy approach
  * Language-agnostic (no spaces required), making it ideal for multilingual text

SentencePiece is used in T5, Llama, and several Google models.

===== tiktoken (OpenAI) =====

tiktoken is OpenAI's fast, Rust-based BPE tokenizer library supporting multiple encodings. It handles regex preprocessing and is approximately 10x faster than pure Python BPE implementations. ((https://www.salttechno.ai/datasets/llm-model-comparison-2026/|SaltTechno: LLM Comparison))

Key encodings:

  * **cl100k_base**: 100,421 vocabulary tokens, used by GPT-4 and GPT-3.5-turbo. Approximately 1 token per 4 English characters.
  * **o200k_base**: Approximately 200,000 vocabulary tokens, used by GPT-4o, GPT-4.1, and o-series models. 15-20% more efficient than cl100k_base, especially for non-English text and code.

===== Model-Specific Tokenizers =====

^ Model Family ^ Algorithm ^ Vocabulary Size ^ Context Window ^ Notes ^
| GPT-4 | BPE (cl100k_base) | 100,421 | 128K | English-optimized |
| GPT-4o / GPT-4.1 / o3 | BPE (o200k_base) | ~200,000 | 128K-1M | Better multilingual and code support |
| Claude | Custom BPE variant | ~200,000 | 200K | Optimized for long documents |
| Llama 3/4 | SentencePiece (Unigram) | 128K-256K | Up to 10M (Llama 4 Scout) | Byte-fallback for multilingual |
| Gemini | Custom | 256K+ | 1M-2M | Multimodal-tuned |

===== Token Count Comparisons =====

The same text produces different token counts across models due to vocabulary size, training data, and algorithm differences. Typical variance is 10-50%. ((https://www.salttechno.ai/datasets/llm-model-comparison-2026/|SaltTechno: LLM Model Comparison))

^ Text Type (1K chars) ^ GPT-4 (cl100k) ^ GPT-4o (o200k) ^ Claude ^ Llama 4 ^
| English prose | ~250 tokens | ~210 tokens | ~230 tokens | ~220 tokens |
| Source code | ~300 tokens | ~260 tokens | ~270 tokens | ~240 tokens |
| Multilingual (mixed) | ~350 tokens | ~300 tokens | ~280 tokens | ~260 tokens |

===== Multilingual Tokenization =====

Early English-centric tokenizers like cl100k_base use 2-4x more tokens for Asian languages due to byte-level splitting. Modern tokenizers (o200k_base, Llama, Gemini) dedicate vocabulary entries to non-Latin scripts, significantly reducing this overhead. ((https://www.salttechno.ai/datasets/llm-model-comparison-2026/|SaltTechno: LLM Comparison))

For example, Chinese characters may require 4-6 tokens with cl100k_base but only 1-2 tokens with o200k_base or SentencePiece-based tokenizers. This means non-English prompts can cost 1.5-3x more with older tokenizers.

===== Impact on Cost and Context =====

Token count directly scales API costs. A 100K-token prompt at $2/M input tokens costs $0.20. More efficient tokenizers reduce both cost and context window consumption: ((https://www.cloudidr.com/blog/llm-pricing-comparison-2026|CloudIDR: LLM Pricing Comparison))

  * o200k_base and Llama tokenizers save 15-30% over cl100k_base on mixed workloads
  * Multilingual savings compound further for global applications
  * Always test with the model-specific tokenizer for precise token counts before deployment

===== See Also =====

  * [[prompt_caching|Prompt Caching]]
  * [[embedding_models_comparison|Embedding Models Comparison]]

===== References =====