Locality-Sensitive Hashing

Locality-Sensitive Hashing (LSH) is a family of algorithmic techniques that hash similar input items into the same buckets with high probability.¹⁾: towards removing the curse of dimensionality.“ Proceedings of STOC, 1998.]])) By exploiting hash collisions rather than avoiding them, LSH enables sub-linear time approximate nearest neighbor search in high-dimensional spaces. Introduced by Indyk and Motwani (1998) with formal theoretical guarantees, LSH remains one of the foundational methods for scalable similarity search, particularly valued for its provable bounds, low memory footprint, and suitability for streaming and insert-heavy workloads.

Hash Function Families

LSH requires hash function families where the collision probability is higher for similar items than for dissimilar items. For a distance threshold $r$ and approximation factor $c > 1$, an LSH family satisfies:

$$P[h(\mathbf{x}) = h(\mathbf{y})] \geq p_1 \quad \text{if } d(\mathbf{x}, \mathbf{y}) \leq r$$

$$P[h(\mathbf{x}) = h(\mathbf{y})] \leq p_2 \quad \text{if } d(\mathbf{x}, \mathbf{y}) > cr$$

where $p_1 > p_2$ (close points collide often, far points rarely collide).

Random Hyperplane Projections (SimHash). For cosine similarity or angular distance, each hash function projects vectors onto a random hyperplane and takes the sign of the dot product: $h(\mathbf{x}) = \text{sign}(\mathbf{w} \cdot \mathbf{x})$. Points with small angular separation are likely to produce the same bit. SimHash (Charikar, 2002) extends this to create compact binary signatures (fingerprints), enabling fast Hamming distance comparisons.²⁾ Widely used for text similarity, near-duplicate detection, and embedding search.

MinHash. For Jaccard similarity on sets, MinHash estimates set overlap by applying random permutations and taking the minimum hash value. The probability that two sets share a MinHash value equals their Jaccard similarity: $P[h_\pi(A) = h_\pi(B)] = J(A, B) = \frac{|A \cap B|}{|A \cup B|}$. Used extensively in document deduplication (e.g., near-duplicate web page detection) and data integration.

Cross-Polytope LSH. Uses random rotations followed by projection onto the vertices of a cross-polytope (the high-dimensional analog of an octahedron). This provides tighter theoretical guarantees for Euclidean distance than standard random projections, with better practical performance in moderate dimensions. Introduced by Andoni et al., 2015.³⁾

$p$-Stable Distribution LSH. For $L_p$ distances (including Euclidean, $L_2$), hash functions use random projections from $p$-stable distributions, quantized into buckets: $h(\mathbf{x}) = \lfloor \frac{\mathbf{a} \cdot \mathbf{x} + b}{w} \rfloor$ where $\mathbf{a}$ is drawn from a $p$-stable distribution and $w$ is the bucket width. Datar et al. (2004) established this approach for $L_1$ and $L_2$ distances with formal guarantees.⁴⁾

Multi-Probe and Multi-Table LSH

Raw LSH has limited recall because nearby points may fall into different buckets. Two strategies address this:

Multi-Table LSH builds $L$ independent hash tables, each using $k$ concatenated hash functions. A query probes all $L$ tables and unions the candidate sets. Increasing $L$ improves recall but increases memory ($O(nL)$) and query time linearly. The optimal setting is $L \approx n^\rho$, where $\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)}$.

Multi-Probe LSH (Lv et al., 2007) probes nearby buckets in each table rather than building many tables, achieving similar recall with fewer tables and less memory.⁵⁾ This is the preferred approach in practice, reducing memory by 5-10x versus multi-table LSH at equivalent recall.

Hash Amplification concatenates $k$ hash functions to increase gap between $p_1$ and $p_2$, reducing false positives at the cost of increasing false negatives. The interplay between $k$ (functions per table) and $L$ (number of tables) determines the overall accuracy-efficiency tradeoff.

LSH for Different Distance Metrics

LSH families exist for multiple distance metrics, making the technique versatile:

Cosine similarity: SimHash / random hyperplane projections
Jaccard similarity: MinHash with optional $b$-bit compression
Euclidean distance ($L_2$): $p$-stable distributions or cross-polytope LSH
Manhattan distance ($L_1$): Cauchy distribution projections
Inner product (MIPS): Asymmetric LSH (Shrivastava and Li, 2014) that applies different transformations to queries and database vectors to handle norm biases⁶⁾

Performance Trade-offs and Parameter Tuning

LSH offers different strengths compared to graph-based methods like HNSW:

Aspect	LSH	HNSW / Graph-Based
Query Speed	Sublinear $O(n^\rho)$; GPU-friendly binary ops	Often faster in practice ($O(\log n)$ via navigation)
Build Time	Fast, linear in $n$; easy inserts and deletes	Slower graph construction; dynamic updates costly
Memory	Low (binary sketches, ~32-128 bits per point)	Higher (graph edges, ~10-100 neighbors per node)
Recall	Tunable 90-99% via amplification	Often superior at 99%+ precision
Theoretical Guarantees	Formal provable bounds	Empirical, no worst-case guarantees
Streaming / Updates	Excellent (rehash individual items)	Poor (graph restructuring needed)

When to use LSH: Insert/deletion-heavy workloads (recommendation streams), memory-constrained environments, sparse or very high-dimensional data (text, molecular fingerprints), settings requiring provable approximation guarantees, or when binary quantization provides sufficient precision.

When to use HNSW/graphs: Query-heavy workloads with static or slowly changing data, dense vector search requiring >99% recall, low-latency requirements on moderate-scale datasets.

Modern Implementations (2025)

LIGS (SIGIR 2025): LSH-simulated graphs that achieve 10x faster index maintenance while matching graph-based search quality, bridging LSH efficiency with graph accuracy
GPU-LSH (arXiv, 2025): Binary quantization via LSH for hard negative mining in contrastive learning, exploiting GPU parallelism on binary operations
FAISS: Includes LSH index implementation (IndexLSH) for quick prototyping and baseline comparisons
FastLSH: Combines sampling with projections for $L_2$ distance with formal guarantees
FalconLSH/pyliblocalsearch: SimHash for embedding similarity, demonstrating 1000x speedup over exact k-NN

References

¹⁾

Indyk, P. and Motwani, R. "[[approximate_nearest_neighbors|Approximate nearest neighbors

²⁾

Charikar, M. "Similarity estimation techniques from rounding algorithms." Proceedings of STOC, 2002.

³⁾

Andoni, A. et al. "Practical and Optimal LSH for Angular Distance." arXiv:1509.02897, 2015.

⁴⁾

Datar, M. et al. "Locality-sensitive hashing scheme based on p-stable distributions." Proceedings of SoCG, 2004.

⁵⁾

Lv, Q. et al. "Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search." Proceedings of VLDB, 2007.

⁶⁾

Shrivastava, A. and Li, P. "Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)." arXiv:1405.5869, 2014.

AI Agent Knowledge Base

Sidebar

Table of Contents

Locality-Sensitive Hashing

Hash Function Families

Multi-Probe and Multi-Table LSH

LSH for Different Distance Metrics

Performance Trade-offs and Parameter Tuning

Modern Implementations (2025)

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Locality-Sensitive Hashing

Hash Function Families

Multi-Probe and Multi-Table LSH

LSH for Different Distance Metrics

Performance Trade-offs and Parameter Tuning

Modern Implementations (2025)

See Also

References

Page Tools