Tokenization

Tokenization is the process of converting raw text into discrete numerical tokens that language models can process. Modern LLMs use subword tokenization algorithms that balance vocabulary size against sequence length, enabling efficient handling of rare words, multilingual text, and structured formats like JSON and code.

How Subword Tokenization Works

Subword tokenization splits text into units between characters and full words. The key insight is that common words remain intact while rare words decompose into frequent subword pieces:

“understanding” might tokenize as: [“under”, “stand”, “ing”]
“unbelievable” might become: [“un”, “believ”, “able”]
Unknown words always decompose into known subwords or bytes, eliminating out-of-vocabulary errors

The process at inference time:

Normalize the input text (whitespace, unicode normalization)
Greedily match the longest vocabulary token from left to right
Fall back to smaller subwords or individual bytes for unmatched segments
Map each token to its integer ID from the vocabulary

Byte Pair Encoding (BPE)

BPE builds its vocabulary iteratively from a training corpus:

Step 1: Initialize vocabulary with all individual bytes (or characters)
Step 2: Count all adjacent token pairs in the corpus
Step 3: Merge the most frequent pair into a new token
Step 4: Repeat steps 2-3 for a target number of merges (e.g., 50,000)

Each merge creates a longer subword. The final vocabulary contains the base characters plus all merged tokens. BPE is used by GPT-2, GPT-3, GPT-4, and many other models.

SentencePiece

SentencePiece (Google) is a language-agnostic tokenizer that operates directly on raw Unicode text without requiring pre-tokenization or whitespace segmentation:

Treats the input as a raw byte stream, making it suitable for any language
Supports both BPE and Unigram algorithms
Unigram starts with a large vocabulary and prunes tokens with lowest loss impact
Used by Llama, T5, PaLM, and multilingual models
Handles whitespace as a special character (substituting spaces with a visible marker)

tiktoken

tiktoken is OpenAI's fast BPE tokenizer implementation written in Rust:

Implements the cl100k_base encoding used by GPT-4 (100K+ vocabulary)
Provides deterministic, reproducible tokenization
Designed for precise token counting before API calls

Vocabulary Sizes

Model	Tokenizer	Vocab Size	Context Length
GPT-2	BPE	50,257	1,024
GPT-4	BPE (tiktoken)	100,277	128K
Llama 2	SentencePiece	32,000	4,096
Llama 3	SentencePiece (BPE)	128,256	128K
BERT	WordPiece	30,522	512
Claude 3	BPE variant	~100K	200K

Impact on Agent Tool Use

Tokenization has significant implications for AI agents that generate structured outputs:

JSON tokens: Structured output like {“name”: “search”, “args”: {“query”: “test”}} consumes far more tokens than the equivalent natural language. Curly braces, colons, quotes, and keys each become separate tokens. A single JSON function call can cost 20-40 tokens.

Code tokens: Programming syntax (brackets, operators, indentation) tokenizes inefficiently. A line like import numpy as np becomes 5-8 tokens. Minified code saves tokens but reduces model comprehension.

Context budget pressure: Agents must fit system prompts, conversation history, tool definitions, and tool outputs within the context window. Token-inefficient formats like verbose JSON schemas consume budget that could hold more useful context.

Token Counting and Context Budget

Approximate conversion rates:

English text: ~1 token per 4 characters, or ~0.75 words per token
Code: ~1 token per 3 characters (more syntax overhead)
JSON: ~1 token per 2-3 characters (high punctuation density)
Non-Latin scripts: Often 1 token per 1-2 characters

Budget management strategies:

Count tokens precisely with the model's tokenizer before submission
Prioritize recent and relevant context; summarize older messages
Use compact tool schemas and concise system prompts
Reserve a token budget for the expected completion length
Implement sliding window or summarization for long conversations

Code Example

import tiktoken
 
# Initialize the tokenizer for GPT-4
enc = tiktoken.get_encoding("cl100k_base")
 
# Tokenize text
text = "The Transformer architecture uses self-attention mechanisms."
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
print(f"Decoded tokens: {[enc.decode([t]) for t in tokens]}")
 
# Compare token costs of different formats
examples = {
    "natural_language": "Search for weather in Tokyo",
    "json_format": '{"tool": "search", "args": {"query": "weather Tokyo"}}',
    "python_code": "result = search_api(query='weather Tokyo')",
}
 
for name, example in examples.items():
    token_count = len(enc.encode(example))
    ratio = len(example) / token_count
    print(f"{name}: {token_count} tokens ({ratio:.1f} chars/token)")
 
# Context budget calculator
def check_budget(messages, max_tokens=128000, reserve_completion=4096):
    total = sum(len(enc.encode(m)) for m in messages)
    available = max_tokens - total - reserve_completion
    print(f"Used: {total}, Available: {available}, Reserve: {reserve_completion}")
    return available > 0

AI Agent Knowledge Base

Sidebar

Table of Contents

Tokenization

How Subword Tokenization Works

Byte Pair Encoding (BPE)

SentencePiece

tiktoken

Vocabulary Sizes

Impact on Agent Tool Use

Token Counting and Context Budget

Code Example

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Tokenization

How Subword Tokenization Works

Byte Pair Encoding (BPE)

SentencePiece

tiktoken

Vocabulary Sizes

Impact on Agent Tool Use

Token Counting and Context Budget

Code Example

References

See Also

Page Tools