====== Haystack Engineering ====== **Haystack Engineering** is a context engineering methodology for constructing realistic long-context evaluation benchmarks that capture the noise, heterogeneity, and cascading errors encountered in real-world agentic workflows. Introduced by Li et al. from Georgia Tech and Meta (2025), it addresses the gap between synthetic needle-in-a-haystack (NIAH) tests and the messy reality of retrieval-augmented and agentic systems through the HaystackCraft benchmark. ===== Motivation ===== Modern long-context LLMs perform impressively on standard NIAH benchmarks, where a single fact is embedded in clean, uniform padding text. However, these benchmarks fail to capture two critical real-world phenomena: * **Heterogeneous Distractors.** Real retrieval systems return noisy, topically related but incorrect passages that are far more confusing than random padding text * **Cascading Errors.** In agentic workflows, the model's own previous outputs become part of future context, creating self-reinforcing error loops Haystack engineering argues that how the haystack is constructed matters as much as where the needle is placed. ===== The HaystackCraft Benchmark ===== HaystackCraft is a NIAH benchmark built on the **full English Wikipedia hyperlink network** with multi-hop questions. It differs from prior benchmarks in several key ways: **Heterogeneous Retrieval Strategies.** HaystackCraft evaluates how different retrieval methods affect distractor composition and downstream LLM performance: * **Sparse retrieval** (e.g., BM25) -- keyword-based matching * **Dense retrieval** -- embedding-based semantic similarity * **Hybrid retrieval** -- combining sparse and dense signals * **Graph-based retrieval** -- exploiting Wikipedia's hyperlink structure Each strategy produces qualitatively different distractors, and stronger dense retrievers can paradoxically introduce //more challenging// distractors because semantically similar passages are harder to distinguish from relevant ones. **Agentic Evaluation Mode.** HaystackCraft extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations: * Models refine their own retrieval queries based on initial results * Models reflect on past reasoning steps and incorporate them into context * Models must decide when to stop searching and commit to an answer # Simplified HaystackCraft evaluation pipeline class HaystackCraftEval: def __init__(self, wiki_graph, retriever_suite): self.graph = wiki_graph self.retrievers = retriever_suite # sparse, dense, hybrid, graph def build_haystack(self, question, retriever_type, context_len): # Retrieve heterogeneous distractors gold_docs = self.graph.get_evidence_chain(question) distractors = self.retrievers[retriever_type].search( question, k=context_len - len(gold_docs) ) return interleave(gold_docs, distractors) def agentic_eval(self, model, question): # Dynamic agentic evaluation with self-generated context context = [] for step in range(max_steps): query = model.refine_query(question, context) new_docs = self.retrievers['hybrid'].search(query) context.extend(new_docs) reflection = model.reflect(question, context) context.append(reflection) # Self-generated distractor risk if model.should_stop(question, context): break return model.answer(question, context) ===== Key Findings ===== Experiments with 15 long-context models reveal several important findings: **Finding 1: Dense retrievers create harder distractors.** Stronger dense retrievers improve recall but simultaneously introduce more semantically confusing distractors. Graph-based reranking uniquely provides a solution -- it improves retrieval effectiveness while also mitigating harmful distractors. **Finding 2: Agentic settings amplify failures.** Even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors. The model's own reflections and query refinements can introduce errors that compound across steps. **Finding 3: Early stopping is hard.** Models struggle to determine when they have enough information, often continuing to search and accumulating more distractors that degrade performance. **Finding 4: Haystack ordering matters.** The position and arrangement of relevant documents among distractors significantly affects performance, even for models claiming robust long-context capabilities. ===== Context Engineering Framework ===== Haystack engineering provides a principled framework for thinking about context construction: $$\text{Performance} = f(\text{needle\_difficulty}, \text{haystack\_composition}, \text{retriever\_bias}, \text{agentic\_cascades})$$ Traditional NIAH benchmarks only vary needle difficulty and position. HaystackCraft demonstrates that the other three factors are equally important -- and in agentic settings, cascading errors can dominate all other factors. ===== Implications ===== * Standard NIAH benchmarks **overestimate** real-world long-context capabilities * Retrieval system choice affects not just recall but the **difficulty of the reasoning task** presented to the LLM * Agentic systems must be evaluated with **self-generated context** in the loop, not just externally provided context * Context engineering is a first-class concern for building robust agentic systems ===== References ===== * [[https://arxiv.org/abs/2510.07414|Li et al. (2025). Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation. arXiv:2510.07414]] ===== See Also ===== * [[needle_in_haystack|Needle-in-a-Haystack Evaluation]] * [[retrieval_augmented_generation|Retrieval-Augmented Generation]] * [[long_context_models|Long-Context Language Models]] * [[agentic_evaluation|Agentic Evaluation Methods]]