====== Haystack Engineering ======

**Haystack Engineering** is a context engineering methodology for constructing realistic long-context evaluation benchmarks that capture the noise, heterogeneity, and cascading errors encountered in real-world agentic workflows. Introduced by Li et al. from Georgia Tech and Meta (2025), it addresses the gap between synthetic needle-in-a-haystack (NIAH) tests and the messy reality of retrieval-augmented and agentic systems through the HaystackCraft benchmark.

===== Motivation =====

Modern long-context LLMs perform impressively on standard NIAH benchmarks, where a single fact is embedded in clean, uniform padding text. However, these benchmarks fail to capture two critical real-world phenomena:

  * **Heterogeneous Distractors.** Real retrieval systems return noisy, topically related but incorrect passages that are far more confusing than random padding text
  * **Cascading Errors.** In agentic workflows, the model's own previous outputs become part of future context, creating self-reinforcing error loops

Haystack engineering argues that how the haystack is constructed matters as much as where the needle is placed.

===== The HaystackCraft Benchmark =====

HaystackCraft is a NIAH benchmark built on the **full English Wikipedia hyperlink network** with multi-hop questions. It differs from prior benchmarks in several key ways:

**Heterogeneous Retrieval Strategies.** HaystackCraft evaluates how different retrieval methods affect distractor composition and downstream LLM performance:

  * **Sparse retrieval** (e.g., BM25) -- keyword-based matching
  * **Dense retrieval** -- embedding-based semantic similarity
  * **Hybrid retrieval** -- combining sparse and dense signals
  * **Graph-based retrieval** -- exploiting Wikipedia's hyperlink structure

Each strategy produces qualitatively different distractors, and stronger dense retrievers can paradoxically introduce //more challenging// distractors because semantically similar passages are harder to distinguish from relevant ones.

**Agentic Evaluation Mode.** HaystackCraft extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations:

  * Models refine their own retrieval queries based on initial results
  * Models reflect on past reasoning steps and incorporate them into context
  * Models must decide when to stop searching and commit to an answer

<code python>
# Simplified HaystackCraft evaluation pipeline
class HaystackCraftEval:
    def __init__(self, wiki_graph, retriever_suite):
        self.graph = wiki_graph
        self.retrievers = retriever_suite  # sparse, dense, hybrid, graph
    
    def build_haystack(self, question, retriever_type, context_len):
        # Retrieve heterogeneous distractors
        gold_docs = self.graph.get_evidence_chain(question)
        distractors = self.retrievers[retriever_type].search(
            question, k=context_len - len(gold_docs)
        )
        return interleave(gold_docs, distractors)
    
    def agentic_eval(self, model, question):
        # Dynamic agentic evaluation with self-generated context
        context = []
        for step in range(max_steps):
            query = model.refine_query(question, context)
            new_docs = self.retrievers['hybrid'].search(query)
            context.extend(new_docs)
            reflection = model.reflect(question, context)
            context.append(reflection)  # Self-generated distractor risk
            if model.should_stop(question, context):
                break
        return model.answer(question, context)
</code>

===== Key Findings =====

Experiments with 15 long-context models reveal several important findings:

**Finding 1: Dense retrievers create harder distractors.** Stronger dense retrievers improve recall but simultaneously introduce more semantically confusing distractors. Graph-based reranking uniquely provides a solution -- it improves retrieval effectiveness while also mitigating harmful distractors.

**Finding 2: Agentic settings amplify failures.** Even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors. The model's own reflections and query refinements can introduce errors that compound across steps.

**Finding 3: Early stopping is hard.** Models struggle to determine when they have enough information, often continuing to search and accumulating more distractors that degrade performance.

**Finding 4: Haystack ordering matters.** The position and arrangement of relevant documents among distractors significantly affects performance, even for models claiming robust long-context capabilities.

===== Context Engineering Framework =====

Haystack engineering provides a principled framework for thinking about context construction:

$$\text{Performance} = f(\text{needle\_difficulty}, \text{haystack\_composition}, \text{retriever\_bias}, \text{agentic\_cascades})$$

Traditional NIAH benchmarks only vary needle difficulty and position. HaystackCraft demonstrates that the other three factors are equally important -- and in agentic settings, cascading errors can dominate all other factors.

===== Implications =====

  * Standard NIAH benchmarks **overestimate** real-world long-context capabilities
  * Retrieval system choice affects not just recall but the **difficulty of the reasoning task** presented to the LLM
  * Agentic systems must be evaluated with **self-generated context** in the loop, not just externally provided context
  * Context engineering is a first-class concern for building robust agentic systems

===== References =====

  * [[https://arxiv.org/abs/2510.07414|Li et al. (2025). Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation. arXiv:2510.07414]]

===== See Also =====

  * [[needle_in_haystack|Needle-in-a-Haystack Evaluation]]
  * [[retrieval_augmented_generation|Retrieval-Augmented Generation]]
  * [[long_context_models|Long-Context Language Models]]
  * [[agentic_evaluation|Agentic Evaluation Methods]]