Approximate Nearest Neighbors (ANN) refers to a class of algorithms that trade exactness for speed when searching for the closest points in high-dimensional vector spaces. Rather than exhaustively comparing a query against every item in a database (exact $k$-NN, which scales linearly), ANN methods use indexing structures to achieve sub-linear query times while returning results that are close to the true nearest neighbors with high probability. ANN is the enabling technology behind vector search in AI agent long-term memory, recommendation systems, and retrieval-augmented generation.
The ANN field encompasses several algorithmic families, each with different performance characteristics:
Graph-Based Methods build proximity graphs where nodes are database vectors and edges connect nearby points. Search proceeds by greedy navigation through the graph. HNSW (Malkov and Yashunin, 2016) is the dominant method, achieving $O(\log N)$ query complexity with the best speed-recall tradeoffs in most benchmarks.1) DiskANN (Subramanya et al., NeurIPS 2019) extends graph search to disk-resident datasets, enabling billion-scale search without loading the full index into RAM.2)
Quantization-Based Methods compress vectors to reduce memory and enable fast approximate distance computation. FAISS (Johnson et al., 2017) implements product quantization (PQ) and inverted file indexes (IVF).3) ScaNN (Guo et al., 2020) uses anisotropic vector quantization optimized for inner product search.4) These methods excel when memory is constrained.
Tree-Based Methods partition the vector space using hierarchical tree structures. ANNOY (Approximate Nearest Neighbors Oh Yeah), developed by Spotify, builds forests of random projection trees. Each tree splits the space along random hyperplanes, and search traverses multiple trees to collect candidates. ANNOY is particularly well-suited for read-heavy, static workloads.
Hash-Based Methods use locality-sensitive hashing to group similar vectors into the same buckets with probability $P[h(\mathbf{x}) = h(\mathbf{y})]$ that increases with similarity. LSH provides provable approximation guarantees but typically achieves lower recall than graph or quantization methods at equivalent speed.
ANNOY, created by Erik Bernhardsson at Spotify, uses an elegant approach: build multiple random projection trees that recursively bisect the vector space along random hyperplanes. At query time, search descends multiple trees and unions the candidate leaves, then computes exact distances on the candidate set.
Index Construction. Each tree is built by: (1) picking two random points from the dataset, (2) splitting all points by which of the two they are closer to, and (3) recursing on each half until leaves contain fewer than a threshold number of points. Building $n_{\text{trees}}$ trees provides redundancy.
Query. The query descends all trees, collecting leaf nodes. The search_k parameter controls how many candidates to examine (more candidates = higher recall, slower search). Candidates are re-ranked by exact distance.
Key Properties:
ANNOY remains popular for static corpora and memory-constrained environments (embedded systems, mobile applications), but trails HNSW and ScaNN in recall on large-scale benchmarks.
The ann-benchmarks project (Aumller et al., updated 2025) provides standardized comparisons across datasets (SIFT1M, GIST1M, GloVe, Deep1B) and metrics (recall@10 vs. QPS). Key findings:
| Library | Speed (QPS) | Memory | Recall | Best For |
| ANNOY | High on small/static | Low | Medium (90-95%) | Static corpora, low RAM, shared indexes |
| HNSW | Very high | High (graph edges) | High (95-99%+) | Dynamic data, high-recall retrieval |
| FAISS (IVF+PQ) | High (GPU boost) | Medium (quantized) | High (tunable) | Billion-scale, GPU-available |
| ScaNN | High | Medium | Very high | CPU-bound inner product search |
| DiskANN | Moderate | Very low (disk) | High | Billion-scale, RAM-constrained |
No single algorithm dominates across all settings. Performance depends on dataset scale, dimensionality, query distribution, available hardware, and whether the index is static or dynamic.
Filtered ANN Search (FANNS) has emerged as a major focus in 2025. Real-world queries often include attribute filters (e.g., “find similar products in category X with price < $50”). The FANNS benchmark (2025) evaluates methods including Filtered-DiskANN, ACORN, and SeRF, finding that pre-filtering is efficient at low selectivity while post-filtering works better at high selectivity. HNSW variants like MA-NSW build per-attribute graphs for exact-match filter support.
ANN serves as the first stage in multi-stage retrieval pipelines for AI agents:
Stage 1: Candidate Generation. ANN retrieves ~100-1000 candidates from a corpus of millions or billions of embeddings in milliseconds. This reduces the search space by orders of magnitude.
Stage 2: Reranking. A more expensive model (cross-encoder, transformer reranker) scores the candidates with full attention, selecting the top-$k$ results. This two-stage approach achieves both the speed of ANN and the accuracy of neural reranking.
Stage 3: Context Assembly. The top-$k$ results are formatted and injected into the agent's context window for grounded generation.
In production agent systems, the choice of ANN algorithm depends on the deployment constraints: HNSW for dynamic, high-recall needs (Pinecone, Weaviate); FAISS IVF+PQ for billion-scale with GPU acceleration; ScaNN for CPU-optimized inner product search; ANNOY for lightweight, read-heavy workloads; and DiskANN for datasets too large for RAM.
Industrial applications include recommendation (Netflix, Pinterest, Spotify using ANNOY), semantic search (Google using ScaNN), and agent memory (LangChain and LlamaIndex integrating FAISS and HNSW-based vector stores).