Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Tokenization is the process of converting raw text into discrete numerical tokens that language models can process. Modern LLMs use subword tokenization algorithms that balance vocabulary size against sequence length, enabling efficient handling of rare words, multilingual text, and structured formats like JSON and code.
Subword tokenization splits text into units between characters and full words. The key insight is that common words remain intact while rare words decompose into frequent subword pieces:
The process at inference time:
BPE builds its vocabulary iteratively from a training corpus:
Each merge creates a longer subword. The final vocabulary contains the base characters plus all merged tokens. BPE is used by GPT-2, GPT-3, GPT-4, and many other models.
SentencePiece (Google) is a language-agnostic tokenizer that operates directly on raw Unicode text without requiring pre-tokenization or whitespace segmentation:
tiktoken is OpenAI's fast BPE tokenizer implementation written in Rust:
| Model | Tokenizer | Vocab Size | Context Length |
|---|---|---|---|
| GPT-2 | BPE | 50,257 | 1,024 |
| GPT-4 | BPE (tiktoken) | 100,277 | 128K |
| Llama 2 | SentencePiece | 32,000 | 4,096 |
| Llama 3 | SentencePiece (BPE) | 128,256 | 128K |
| BERT | WordPiece | 30,522 | 512 |
| Claude 3 | BPE variant | ~100K | 200K |
Tokenization has significant implications for AI agents that generate structured outputs:
JSON tokens: Structured output like {“name”: “search”, “args”: {“query”: “test”}} consumes far more tokens than the equivalent natural language. Curly braces, colons, quotes, and keys each become separate tokens. A single JSON function call can cost 20-40 tokens.
Code tokens: Programming syntax (brackets, operators, indentation) tokenizes inefficiently. A line like import numpy as np becomes 5-8 tokens. Minified code saves tokens but reduces model comprehension.
Context budget pressure: Agents must fit system prompts, conversation history, tool definitions, and tool outputs within the context window. Token-inefficient formats like verbose JSON schemas consume budget that could hold more useful context.
Approximate conversion rates:
Budget management strategies:
import tiktoken # Initialize the tokenizer for GPT-4 enc = tiktoken.get_encoding("cl100k_base") # Tokenize text text = "The Transformer architecture uses self-attention mechanisms." tokens = enc.encode(text) print(f"Text: {text}") print(f"Tokens: {tokens}") print(f"Token count: {len(tokens)}") print(f"Decoded tokens: {[enc.decode([t]) for t in tokens]}") # Compare token costs of different formats examples = { "natural_language": "Search for weather in Tokyo", "json_format": '{"tool": "search", "args": {"query": "weather Tokyo"}}', "python_code": "result = search_api(query='weather Tokyo')", } for name, example in examples.items(): token_count = len(enc.encode(example)) ratio = len(example) / token_count print(f"{name}: {token_count} tokens ({ratio:.1f} chars/token)") # Context budget calculator def check_budget(messages, max_tokens=128000, reserve_completion=4096): total = sum(len(enc.encode(m)) for m in messages) available = max_tokens - total - reserve_completion print(f"Used: {total}, Available: {available}, Reserve: {reserve_completion}") return available > 0