====== How to Choose Chunk Size for RAG ======

Chunk size is the most underestimated hyperparameter in Retrieval-Augmented Generation. It silently determines what your LLM sees, how much retrieval costs, and how accurate answers are. This guide synthesizes published benchmark data and practical strategies for choosing optimal chunk sizes.(([[https://www.researchgate.net/publication/394594247_The_Effect_of_Chunk_Size_on_the_RAG_Performance|ResearchGate - Effect of Chunk Size on RAG Performance]]))

===== Why Chunk Size Matters =====

  * **Too small** — Loses context, fragments meaning, increases retrieval noise
  * **Too large** — Dilutes relevance, wastes context window, reduces precision
  * **Just right** — Balances semantic completeness with retrieval specificity

In NVIDIA's 2024 benchmark across 7 strategies and 5 datasets, wrong chunking strategy reduced recall by up to 9%.(([[https://www.firecrawl.dev/blog/best-chunking-strategies-rag|Firecrawl - Best Chunking Strategies for RAG (Vecta 2026 Benchmark)]]))(([[https://stackoverflow.blog/2024/12/27/breaking-up-is-hard-to-do-chunking-in-rag-applications/|Stack Overflow - Chunking in RAG Applications]]))

===== Decision Tree =====

<mermaid>
graph TD
    A[Start: What content type?] --> B{Structured with\nclear sections?}
    B -->|Yes| C{Pages or\nchapters?}
    B -->|No| D{Content type?}
    C -->|Pages| E[Page-Level Chunking]
    C -->|Sections| F[Recursive Splitting\n512 tokens]
    D -->|Code| G[AST or Function-Level\n+ Recursive fallback]
    D -->|Legal/Academic| E
    D -->|Technical Docs| F
    D -->|Conversational| H[Semantic Chunking]
    D -->|Mixed/Unknown| F
    F --> I{Need higher\naccuracy?}
    I -->|Yes| J[Multi-Scale Indexing\n100+200+500 tokens]
    I -->|No| K[Single scale\n512 tokens + 10-20pct overlap]
    style E fill:#4CAF50,color:#fff
    style F fill:#2196F3,color:#fff
    style G fill:#FF9800,color:#fff
    style H fill:#9C27B0,color:#fff
    style J fill:#E91E63,color:#fff
    style K fill:#2196F3,color:#fff
</mermaid>

===== Chunking Strategies Compared =====

^ Strategy ^ How It Works ^ Benchmark Accuracy ^ Avg Chunk Size ^ Speed ^ Best For ^
| **Fixed-Size** | Split by token/character count | 13% (clinical baseline) | Exact target | Fastest | Prototyping only |
| **Recursive Character** | Split by sections then paragraphs then sentences | **69% (Vecta 2026)** | 400-512 tokens | Fast | General text (default) |
| **Page-Level** | Keep full pages intact | **0.648 (NVIDIA 2024)** | Page-native | Fast | PDFs, legal, academic |
| **Semantic** | Group by embedding similarity | 54% (Vecta) / 87% (adaptive) | Variable (43 tokens avg) | Slow | Topic-heavy content |
| **Sentence-Level** | Split at sentence boundaries | Moderate | ~66 chars avg | Fast | Conversational data |
| **Parent-Child** | Small chunks for retrieval, large parents for context | +5-10% over flat | Dual (128/512) | Medium | Complex documents |
| **Multi-Scale** | Index at multiple sizes, fuse results | +7-13% over single | Multiple | Slowest | Maximum accuracy |

//Sources: Vecta 2026 (50 papers), NVIDIA 2024 benchmark, MDPI 2025 clinical study, AI21 Labs 2026//(([[https://pmc.ncbi.nlm.nih.gov/articles/PMC12649634/|MDPI 2025 - Adaptive Chunking in Clinical RAG]]))

===== Optimal Sizes by Content Type =====

^ Content Type ^ Recommended Size ^ Overlap ^ Strategy ^ Notes ^
| **Technical documentation** | 512 tokens | 50-100 tokens (10-20%) | Recursive | Best general default |
| **Code** | Function-level (variable) | 0 (natural boundaries) | AST-aware / Recursive | Respect function and class boundaries |
| **Legal documents** | Page-level or 1024 tokens | 100-200 tokens | Page-level / Recursive | Preserve clause structure |
| **Academic papers** | 512-1024 tokens | 100 tokens | Recursive with section headers | Include metadata |
| **Conversational logs** | 256-512 tokens | 50 tokens | Semantic | Group by topic shifts |
| **Product descriptions** | 256 tokens | 25 tokens | Fixed or Recursive | Short, self-contained |
| **FAQ/Q&A pairs** | Per-question (natural) | 0 | Document-level | Each Q&A is one chunk |

===== Overlap Strategies =====

Overlap prevents information loss at chunk boundaries. The industry standard is **10-20% overlap**.

^ Overlap Percent ^ Token Example (500 chunk) ^ Benefit ^ Cost ^
| 0% | 0 tokens | Minimum storage | Breaks cross-boundary info |
| 10% | 50 tokens | Good balance | +10% storage |
| 20% | 100 tokens | Better boundary coverage | +20% storage |
| 50% | 250 tokens | Sliding window | +50% storage, diminishing returns |

**Rule of thumb**: Use 10% overlap for clean structured text, 20% for dense unstructured text, 0% for naturally bounded content (Q&A pairs, code functions).

===== Chunk Size vs Query Type =====

Research from NVIDIA (2024) and AI21 Labs (2026) shows optimal chunk size depends on query type:

^ Query Type ^ Optimal Size ^ Rationale ^
| Factoid ("What is X?") | 256-512 tokens | Precise, focused retrieval |
| Analytical ("Compare X and Y") | 1024+ tokens | Needs broader context |
| Summarization ("Summarize this doc") | Page-level / 2048 tokens | Needs comprehensive view |
| Code search ("How to implement X") | Function-level | Natural semantic boundary |

AI21 Labs demonstrated that **multi-scale indexing** (100, 200, 500 tokens with Reciprocal Rank Fusion) outperforms any single chunk size because different queries need different granularity.(([[https://www.ai21.com/blog/query-dependent-chunking/|AI21 Labs - Query-Dependent Chunking: Multi-Scale Approach]]))

===== Implementation Example =====

<code python>
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Default: Recursive splitting at 512 tokens with 10% overlap
def create_splitter(content_type="general"):
    configs = {
        "general": {"chunk_size": 512, "chunk_overlap": 50},
        "code": {"chunk_size": 1000, "chunk_overlap": 0,
                 "separators": ["\nclass ", "\ndef ", "\n\n", "\n"]},
        "legal": {"chunk_size": 1024, "chunk_overlap": 200},
        "conversational": {"chunk_size": 256, "chunk_overlap": 50},
        "faq": {"chunk_size": 300, "chunk_overlap": 0,
                "separators": ["\n\n"]},
    }
    config = configs.get(content_type, configs["general"])
    return RecursiveCharacterTextSplitter(
        chunk_size=config["chunk_size"],
        chunk_overlap=config["chunk_overlap"],
        separators=config.get("separators", ["\n\n", "\n", ". ", " "]),
        length_function=len,
    )

# Multi-scale indexing for maximum accuracy
def multi_scale_index(documents, sizes=[100, 256, 512]):
    from collections import defaultdict
    all_chunks = defaultdict(list)
    for size in sizes:
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=size, chunk_overlap=int(size * 0.1)
        )
        for doc in documents:
            chunks = splitter.split_text(doc)
            all_chunks[size].extend(chunks)
    return all_chunks  # Index each scale separately, fuse at query time

# Parent-child chunking
def parent_child_chunks(documents, parent_size=1024, child_size=256):
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=parent_size)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=child_size)
    index = []
    for doc in documents:
        parents = parent_splitter.split_text(doc)
        for parent in parents:
            children = child_splitter.split_text(parent)
            for child in children:
                index.append({"child": child, "parent": parent})
    return index  # Search on child, retrieve parent
</code>

===== Evaluation Methodology =====

Always benchmark chunk sizes on your own data. Key metrics:

  * **Retrieval Recall@k** — Does the correct chunk appear in top k results?
  * **Answer Accuracy** — Does the final LLM answer match ground truth?
  * **Latency** — Time from query to retrieved chunks
  * **Cost** — Embedding tokens + storage + retrieval compute

<code python>
# Simple chunk size evaluation
def evaluate_chunk_sizes(documents, test_queries, ground_truth, sizes=[256, 512, 1024]):
    results = {}
    for size in sizes:
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=size, chunk_overlap=int(size * 0.1)
        )
        chunks = splitter.split_documents(documents)
        vectorstore = build_vectorstore(chunks)

        correct = 0
        for query, expected in zip(test_queries, ground_truth):
            retrieved = vectorstore.similarity_search(query, k=5)
            if any(expected in r.page_content for r in retrieved):
                correct += 1
        results[size] = correct / len(test_queries)
    return results  # e.g. {256: 0.72, 512: 0.81, 1024: 0.76}
</code>

===== Key Takeaways =====

  - **Start with 512 tokens, recursive splitting, 10% overlap**. This is the best default backed by benchmarks.
  - **Content type matters more than a magic number**. Code, legal, and conversational content each need different strategies.
  - **Multi-scale indexing** gives the best accuracy (+7-13%) but at higher storage and complexity cost.
  - **Parent-child chunking** is the best balance of precision and context for complex documents.
  - **Always benchmark on your data**. Published numbers are starting points, not guarantees.
  - **Query type affects optimal size**: factoid queries want small chunks, analytical queries want large ones.

===== See Also =====

  * [[when_to_use_rag_vs_fine_tuning|When to Use RAG vs Fine-Tuning]] — Choose the right knowledge approach
  * [[how_to_structure_system_prompts|How to Structure System Prompts]] — Optimize prompts for RAG systems
  * [[single_vs_multi_agent|Single vs Multi-Agent Architectures]] — Agent design patterns

===== References =====