====== Locality-Sensitive Hashing ====== Locality-Sensitive Hashing (LSH) is a family of algorithmic techniques that hash similar input items into the same buckets with high probability.(([[https://doi.org/10.1145/276698.276876|Indyk, P. and Motwani, R. "[[approximate_nearest_neighbors|Approximate nearest neighbors]])): towards removing the curse of dimensionality." Proceedings of STOC, 1998.]])) By exploiting hash collisions rather than avoiding them, LSH enables sub-linear time approximate nearest neighbor search in high-dimensional spaces. Introduced by Indyk and Motwani (1998) with formal theoretical guarantees, LSH remains one of the foundational methods for scalable similarity search, particularly valued for its provable bounds, low memory footprint, and suitability for streaming and insert-heavy workloads. ===== Hash Function Families ===== LSH requires hash function families where the collision probability is higher for similar items than for dissimilar items. For a distance threshold $r$ and approximation factor $c > 1$, an LSH family satisfies: $$P[h(\mathbf{x}) = h(\mathbf{y})] \geq p_1 \quad \text{if } d(\mathbf{x}, \mathbf{y}) \leq r$$ $$P[h(\mathbf{x}) = h(\mathbf{y})] \leq p_2 \quad \text{if } d(\mathbf{x}, \mathbf{y}) > cr$$ where $p_1 > p_2$ (close points collide often, far points rarely collide). **Random Hyperplane Projections (SimHash).** For cosine similarity or angular distance, each hash function projects vectors onto a random hyperplane and takes the sign of the dot product: $h(\mathbf{x}) = \text{sign}(\mathbf{w} \cdot \mathbf{x})$. Points with small angular separation are likely to produce the same bit. SimHash (Charikar, 2002) extends this to create compact binary signatures (fingerprints), enabling fast Hamming distance comparisons.(([[https://doi.org/10.1145/509907.509965|Charikar, M. "Similarity estimation techniques from rounding algorithms." Proceedings of STOC, 2002.]])) Widely used for text similarity, near-duplicate detection, and embedding search. **MinHash.** For Jaccard similarity on sets, MinHash estimates set overlap by applying random permutations and taking the minimum hash value. The probability that two sets share a MinHash value equals their Jaccard similarity: $P[h_\pi(A) = h_\pi(B)] = J(A, B) = \frac{|A \cap B|}{|A \cup B|}$. Used extensively in document deduplication (e.g., near-duplicate web page detection) and data integration. **Cross-Polytope LSH.** Uses random rotations followed by projection onto the vertices of a cross-polytope (the high-dimensional analog of an octahedron). This provides tighter theoretical guarantees for Euclidean distance than standard random projections, with better practical performance in moderate dimensions. Introduced by [[https://arxiv.org/abs/1509.02897|Andoni et al., 2015]].(([[https://arxiv.org/abs/1509.02897|Andoni, A. et al. "Practical and Optimal LSH for Angular Distance." arXiv:1509.02897, 2015.]])) **$p$-Stable Distribution LSH.** For $L_p$ distances (including Euclidean, $L_2$), hash functions use random projections from $p$-stable distributions, quantized into buckets: $h(\mathbf{x}) = \lfloor \frac{\mathbf{a} \cdot \mathbf{x} + b}{w} \rfloor$ where $\mathbf{a}$ is drawn from a $p$-stable distribution and $w$ is the bucket width. Datar et al. (2004) established this approach for $L_1$ and $L_2$ distances with formal guarantees.(([[https://doi.org/10.1145/997817.997857|Datar, M. et al. "Locality-sensitive hashing scheme based on p-stable distributions." Proceedings of SoCG, 2004.]])) ===== Multi-Probe and Multi-Table LSH ===== Raw LSH has limited recall because nearby points may fall into different buckets. Two strategies address this: **Multi-Table LSH** builds $L$ independent hash tables, each using $k$ concatenated hash functions. A query probes all $L$ tables and unions the candidate sets. Increasing $L$ improves recall but increases memory ($O(nL)$) and query time linearly. The optimal setting is $L \approx n^\rho$, where $\rho = \frac{\ln(1/p_1)}{\ln(1/p_2)}$. **Multi-Probe LSH** (Lv et al., 2007) probes nearby buckets in each table rather than building many tables, achieving similar recall with fewer tables and less memory.(([[https://doi.org/10.5555/1325851.1325958|Lv, Q. et al. "Multi-Probe LSH: Efficient Indexing for High-Dimensional Similarity Search." Proceedings of VLDB, 2007.]])) This is the preferred approach in practice, reducing memory by 5-10x versus multi-table LSH at equivalent recall. **Hash Amplification** concatenates $k$ hash functions to increase gap between $p_1$ and $p_2$, reducing false positives at the cost of increasing false negatives. The interplay between $k$ (functions per table) and $L$ (number of tables) determines the overall accuracy-efficiency tradeoff. ===== LSH for Different Distance Metrics ===== LSH families exist for multiple distance metrics, making the technique versatile: * **Cosine similarity**: SimHash / random hyperplane projections * **Jaccard similarity**: MinHash with optional $b$-bit compression * **Euclidean distance ($L_2$)**: $p$-stable distributions or cross-polytope LSH * **Manhattan distance ($L_1$)**: Cauchy distribution projections * **Inner product ([[maximum_inner_product_search|MIPS]])**: Asymmetric LSH ([[https://arxiv.org/abs/1405.5869|Shrivastava and Li, 2014]]) that applies different transformations to queries and database vectors to handle norm biases(([[https://arxiv.org/abs/1405.5869|Shrivastava, A. and Li, P. "Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS)." arXiv:1405.5869, 2014.]])) ===== Performance Trade-offs and Parameter Tuning ===== LSH offers different strengths compared to graph-based methods like [[hnsw_graphs|HNSW]]: | Aspect | LSH | HNSW / Graph-Based | | Query Speed | Sublinear $O(n^\rho)$; GPU-friendly binary ops | Often faster in practice ($O(\log n)$ via navigation) | | Build Time | Fast, linear in $n$; easy inserts and deletes | Slower graph construction; dynamic updates costly | | Memory | Low (binary sketches, ~32-128 bits per point) | Higher (graph edges, ~10-100 neighbors per node) | | Recall | Tunable 90-99% via amplification | Often superior at 99%+ precision | | Theoretical Guarantees | Formal provable bounds | Empirical, no worst-case guarantees | | Streaming / Updates | Excellent (rehash individual items) | Poor (graph restructuring needed) | **When to use LSH**: Insert/deletion-heavy workloads (recommendation streams), memory-constrained environments, sparse or very high-dimensional data (text, molecular fingerprints), settings requiring provable approximation guarantees, or when binary quantization provides sufficient precision. **When to use HNSW/graphs**: Query-heavy workloads with static or slowly changing data, dense vector search requiring >99% recall, low-latency requirements on moderate-scale datasets. ===== Modern Implementations (2025) ===== * **LIGS** (SIGIR 2025): LSH-simulated graphs that achieve 10x faster index maintenance while matching graph-based search quality, bridging LSH efficiency with graph accuracy * **GPU-LSH** (arXiv, 2025): Binary quantization via LSH for hard negative mining in contrastive learning, exploiting GPU parallelism on binary operations * **[[faiss|FAISS]]**: Includes LSH index implementation (IndexLSH) for quick prototyping and baseline comparisons * **FastLSH**: Combines sampling with projections for $L_2$ distance with formal guarantees * **FalconLSH/pyliblocalsearch**: SimHash for embedding similarity, demonstrating 1000x speedup over exact k-NN ===== See Also ===== * [[near_duplicate_detection|Near-Duplicate Detection and Deduplication]] * [[image_similarity_search|Image Similarity Search]] * [[hnsw_vs_ivfflat|HNSW vs IVFFlat]] ===== References =====