====== Tokenization ====== Tokenization is the process of converting raw text into discrete numerical tokens that language models can process. Modern LLMs use subword tokenization algorithms that balance vocabulary size against sequence length, enabling efficient handling of rare words, multilingual text, and structured formats like JSON and code. ===== How Subword Tokenization Works ===== Subword tokenization splits text into units between characters and full words. The key insight is that common words remain intact while rare words decompose into frequent subword pieces: - **"understanding"** might tokenize as: ["under", "stand", "ing"] - **"unbelievable"** might become: ["un", "believ", "able"] - **Unknown words** always decompose into known subwords or bytes, eliminating out-of-vocabulary errors The process at inference time: - Normalize the input text (whitespace, unicode normalization) - Greedily match the longest vocabulary token from left to right - Fall back to smaller subwords or individual bytes for unmatched segments - Map each token to its integer ID from the vocabulary ===== Byte Pair Encoding (BPE) ===== BPE builds its vocabulary iteratively from a training corpus: - **Step 1**: Initialize vocabulary with all individual bytes (or characters) - **Step 2**: Count all adjacent token pairs in the corpus - **Step 3**: Merge the most frequent pair into a new token - **Step 4**: Repeat steps 2-3 for a target number of merges (e.g., 50,000) Each merge creates a longer subword. The final vocabulary contains the base characters plus all merged tokens. BPE is used by GPT-2, GPT-3, GPT-4, and many other models. ===== SentencePiece ===== SentencePiece (Google) is a language-agnostic tokenizer that operates directly on raw Unicode text without requiring pre-tokenization or whitespace segmentation: * Treats the input as a raw byte stream, making it suitable for any language * Supports both BPE and Unigram algorithms * Unigram starts with a large vocabulary and prunes tokens with lowest loss impact * Used by Llama, T5, PaLM, and multilingual models * Handles whitespace as a special character (substituting spaces with a visible marker) ===== tiktoken ===== tiktoken is OpenAI's fast BPE tokenizer implementation written in Rust: * Implements the cl100k_base encoding used by GPT-4 (100K+ vocabulary) * Provides deterministic, reproducible tokenization * Designed for precise token counting before API calls ===== Vocabulary Sizes ===== ^ Model ^ Tokenizer ^ Vocab Size ^ Context Length ^ | GPT-2 | BPE | 50,257 | 1,024 | | GPT-4 | BPE (tiktoken) | 100,277 | 128K | | Llama 2 | SentencePiece | 32,000 | 4,096 | | Llama 3 | SentencePiece (BPE) | 128,256 | 128K | | BERT | WordPiece | 30,522 | 512 | | Claude 3 | BPE variant | ~100K | 200K | ===== Impact on Agent Tool Use ===== Tokenization has significant implications for AI agents that generate structured outputs: **JSON tokens**: Structured output like {"name": "search", "args": {"query": "test"}} consumes far more tokens than the equivalent natural language. Curly braces, colons, quotes, and keys each become separate tokens. A single JSON function call can cost 20-40 tokens. **Code tokens**: Programming syntax (brackets, operators, indentation) tokenizes inefficiently. A line like import numpy as np becomes 5-8 tokens. Minified code saves tokens but reduces model comprehension. **Context budget pressure**: Agents must fit system prompts, conversation history, tool definitions, and tool outputs within the context window. Token-inefficient formats like verbose JSON schemas consume budget that could hold more useful context. ===== Token Counting and Context Budget ===== Approximate conversion rates: * **English text**: ~1 token per 4 characters, or ~0.75 words per token * **Code**: ~1 token per 3 characters (more syntax overhead) * **JSON**: ~1 token per 2-3 characters (high punctuation density) * **Non-Latin scripts**: Often 1 token per 1-2 characters **Budget management strategies**: - Count tokens precisely with the model's tokenizer before submission - Prioritize recent and relevant context; summarize older messages - Use compact tool schemas and concise system prompts - Reserve a token budget for the expected completion length - Implement sliding window or summarization for long conversations ===== Code Example ===== import tiktoken # Initialize the tokenizer for GPT-4 enc = tiktoken.get_encoding("cl100k_base") # Tokenize text text = "The Transformer architecture uses self-attention mechanisms." tokens = enc.encode(text) print(f"Text: {text}") print(f"Tokens: {tokens}") print(f"Token count: {len(tokens)}") print(f"Decoded tokens: {[enc.decode([t]) for t in tokens]}") # Compare token costs of different formats examples = { "natural_language": "Search for weather in Tokyo", "json_format": '{"tool": "search", "args": {"query": "weather Tokyo"}}', "python_code": "result = search_api(query='weather Tokyo')", } for name, example in examples.items(): token_count = len(enc.encode(example)) ratio = len(example) / token_count print(f"{name}: {token_count} tokens ({ratio:.1f} chars/token)") # Context budget calculator def check_budget(messages, max_tokens=128000, reserve_completion=4096): total = sum(len(enc.encode(m)) for m in messages) available = max_tokens - total - reserve_completion print(f"Used: {total}, Available: {available}, Reserve: {reserve_completion}") return available > 0 ===== References ===== * [[https://arxiv.org/abs/1508.07909|Sennrich et al. - Neural Machine Translation of Rare Words with Subword Units (BPE, 2016)]] * [[https://arxiv.org/abs/1808.06226|Kudo & Richardson - SentencePiece: A simple and language independent subword tokenizer (2018)]] * [[https://github.com/openai/tiktoken|OpenAI tiktoken - GitHub Repository]] * [[https://arxiv.org/abs/2404.06626|Rajaraman et al. - Toward a Theory of Tokenization in LLMs (2024)]] ===== See Also ===== * [[transformer_architecture|Transformer Architecture]] * [[model_context_window|Model Context Window]] * [[inference_optimization|Inference Optimization]]