Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Locality-Sensitive Hashing (LSH) is a family of algorithmic techniques that hash similar input items into the same buckets with high probability.1): towards removing the curse of dimensionality.“ Proceedings of STOC, 1998.]])) By exploiting hash collisions rather than avoiding them, LSH enables sub-linear time approximate nearest neighbor search in high-dimensional spaces. Introduced by Indyk and Motwani (1998) with formal theoretical guarantees, LSH remains one of the foundational methods for scalable similarity search, particularly valued for its provable bounds, low memory footprint, and suitability for streaming and insert-heavy workloads.
LSH requires hash function families where the collision probability is higher for similar items than for dissimilar items. For a distance threshold $r$ and approximation factor $c > 1$, an LSH family satisfies:
$$P[h(\mathbf{x}) = h(\mathbf{y})] \geq p_1 \quad \text{if } d(\mathbf{x}, \mathbf{y}) \leq r$$
$$P[h(\mathbf{x}) = h(\mathbf{y})] \leq p_2 \quad \text{if } d(\mathbf{x}, \mathbf{y}) > cr$$
where $p_1 > p_2$ (close points collide often, far points rarely collide).
Random Hyperplane Projections (SimHash). For cosine similarity or angular distance, each hash function projects vectors onto a random hyperplane and takes the sign of the dot product: $h(\mathbf{x}) = \text{sign}(\mathbf{w} \cdot \mathbf{x})$. Points with small angular separation are likely to produce the same bit. SimHash (Charikar, 2002) extends this to create compact binary signatures (fingerprints), enabling fast Hamming distance comparisons.2) Widely used for text similarity, near-duplicate detection, and embedding search.
MinHash. For Jaccard similarity on sets, MinHash estimates set overlap by applying random permutations and taking the minimum hash value. The probability that two sets share a MinHash value equals their Jaccard similarity: $P[h_\pi(A) = h_\pi(B)] = J(A, B) = \frac{|A \cap B|}{|A \cup B|}$. Used extensively in document deduplication (e.g., near-duplicate web page detection) and data integration.
Cross-Polytope LSH. Uses random rotations followed by projection onto the vertices of a cross-polytope (the high-dimensional analog of an octahedron). This provides tighter theoretical guarantees for Euclidean distance than standard random projections, with better practical performance in moderate dimensions. Introduced by Andoni et al., 2015.3)
$p$-Stable Distribution LSH. For $L_p$ distances (including Euclidean, $L_2$), hash functions use random projections from $p$-stable distributions, quantized into buckets: $h(\mathbf{x}) = \lfloor \frac{\mathbf{a} \cdot \mathbf{x} + b}{w} \rfloor$ where $\mathbf{a}$ is drawn from a $p$-stable distribution and $w$ is the bucket width. Datar et al. (2004) established this approach for $L_1$ and $L_2$ distances with formal guarantees.4)
Raw LSH has limited recall because nearby points may fall into different buckets. Two strategies address this:
Multi-Table LSH builds $L$ independent hash tables, each using $k$ concatenated hash functions. A query probes all $L$ tables and unions the candidate sets. Increasing $L$ improves recall but increases memory ($O(nL)$) and query time linearly. The optimal setting is $L \approx n^\rho$, where $\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)}$.
Multi-Probe LSH (Lv et al., 2007) probes nearby buckets in each table rather than building many tables, achieving similar recall with fewer tables and less memory.5) This is the preferred approach in practice, reducing memory by 5-10x versus multi-table LSH at equivalent recall.
Hash Amplification concatenates $k$ hash functions to increase gap between $p_1$ and $p_2$, reducing false positives at the cost of increasing false negatives. The interplay between $k$ (functions per table) and $L$ (number of tables) determines the overall accuracy-efficiency tradeoff.
LSH families exist for multiple distance metrics, making the technique versatile:
LSH offers different strengths compared to graph-based methods like HNSW:
| Aspect | LSH | HNSW / Graph-Based |
| Query Speed | Sublinear $O(n^\rho)$; GPU-friendly binary ops | Often faster in practice ($O(\log n)$ via navigation) |
| Build Time | Fast, linear in $n$; easy inserts and deletes | Slower graph construction; dynamic updates costly |
| Memory | Low (binary sketches, ~32-128 bits per point) | Higher (graph edges, ~10-100 neighbors per node) |
| Recall | Tunable 90-99% via amplification | Often superior at 99%+ precision |
| Theoretical Guarantees | Formal provable bounds | Empirical, no worst-case guarantees |
| Streaming / Updates | Excellent (rehash individual items) | Poor (graph restructuring needed) |
When to use LSH: Insert/deletion-heavy workloads (recommendation streams), memory-constrained environments, sparse or very high-dimensional data (text, molecular fingerprints), settings requiring provable approximation guarantees, or when binary quantization provides sufficient precision.
When to use HNSW/graphs: Query-heavy workloads with static or slowly changing data, dense vector search requiring >99% recall, low-latency requirements on moderate-scale datasets.