Haystack Engineering

Haystack Engineering is a context engineering methodology for constructing realistic long-context evaluation benchmarks that capture the noise, heterogeneity, and cascading errors encountered in real-world agentic workflows. Introduced by Li et al. from Georgia Tech and Meta (2025), it addresses the gap between synthetic needle-in-a-haystack (NIAH) tests and the messy reality of retrieval-augmented and agentic systems through the HaystackCraft benchmark.

Motivation

Modern long-context LLMs perform impressively on standard NIAH benchmarks, where a single fact is embedded in clean, uniform padding text. However, these benchmarks fail to capture two critical real-world phenomena:

Heterogeneous Distractors. Real retrieval systems return noisy, topically related but incorrect passages that are far more confusing than random padding text
Cascading Errors. In agentic workflows, the model's own previous outputs become part of future context, creating self-reinforcing error loops

Haystack engineering argues that how the haystack is constructed matters as much as where the needle is placed.

The HaystackCraft Benchmark

HaystackCraft is a NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. It differs from prior benchmarks in several key ways:

Heterogeneous Retrieval Strategies. HaystackCraft evaluates how different retrieval methods affect distractor composition and downstream LLM performance:

Sparse retrieval (e.g., BM25) – keyword-based matching
Dense retrieval – embedding-based semantic similarity
Hybrid retrieval – combining sparse and dense signals
Graph-based retrieval – exploiting Wikipedia's hyperlink structure

Each strategy produces qualitatively different distractors, and stronger dense retrievers can paradoxically introduce more challenging distractors because semantically similar passages are harder to distinguish from relevant ones.

Agentic Evaluation Mode. HaystackCraft extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations:

Models refine their own retrieval queries based on initial results
Models reflect on past reasoning steps and incorporate them into context
Models must decide when to stop searching and commit to an answer

# Simplified HaystackCraft evaluation pipeline
class HaystackCraftEval:
    def __init__(self, wiki_graph, retriever_suite):
        self.graph = wiki_graph
        self.retrievers = retriever_suite  # sparse, dense, hybrid, graph
 
    def build_haystack(self, question, retriever_type, context_len):
        # Retrieve heterogeneous distractors
        gold_docs = self.graph.get_evidence_chain(question)
        distractors = self.retrievers[retriever_type].search(
            question, k=context_len - len(gold_docs)
        )
        return interleave(gold_docs, distractors)
 
    def agentic_eval(self, model, question):
        # Dynamic agentic evaluation with self-generated context
        context = []
        for step in range(max_steps):
            query = model.refine_query(question, context)
            new_docs = self.retrievers['hybrid'].search(query)
            context.extend(new_docs)
            reflection = model.reflect(question, context)
            context.append(reflection)  # Self-generated distractor risk
            if model.should_stop(question, context):
                break
        return model.answer(question, context)

Key Findings

Experiments with 15 long-context models reveal several important findings:

Finding 1: Dense retrievers create harder distractors. Stronger dense retrievers improve recall but simultaneously introduce more semantically confusing distractors. Graph-based reranking uniquely provides a solution – it improves retrieval effectiveness while also mitigating harmful distractors.

Finding 2: Agentic settings amplify failures. Even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors. The model's own reflections and query refinements can introduce errors that compound across steps.

Finding 3: Early stopping is hard. Models struggle to determine when they have enough information, often continuing to search and accumulating more distractors that degrade performance.

Finding 4: Haystack ordering matters. The position and arrangement of relevant documents among distractors significantly affects performance, even for models claiming robust long-context capabilities.

Context Engineering Framework

Haystack engineering provides a principled framework for thinking about context construction:

$$\text{Performance} = f(\text{needle\_difficulty}, \text{haystack\_composition}, \text{retriever\_bias}, \text{agentic\_cascades})$$

Traditional NIAH benchmarks only vary needle difficulty and position. HaystackCraft demonstrates that the other three factors are equally important – and in agentic settings, cascading errors can dominate all other factors.

Implications

Standard NIAH benchmarks overestimate real-world long-context capabilities
Retrieval system choice affects not just recall but the difficulty of the reasoning task presented to the LLM
Agentic systems must be evaluated with self-generated context in the loop, not just externally provided context
Context engineering is a first-class concern for building robust agentic systems

References

Li et al. (2025). Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation. arXiv:2510.07414

Table of Contents