====== Semantic Search ====== Semantic search is a search technique that understands the intent and contextual meaning behind queries using natural language processing and machine learning, rather than relying on exact keyword matches. It works by converting queries and documents into vector embeddings and measuring their similarity in high-dimensional space. ((https://www.elastic.co/what-is/semantic-search|Elastic: What Is Semantic Search)) ((https://www.meilisearch.com/blog/semantic-search|Meilisearch: Semantic Search)) ===== How Semantic Search Works ===== - **Encoding**: Text (queries and documents) is transformed into numerical vectors (embeddings) using transformer models like BERT, MPNet, or dedicated embedding models. These vectors capture semantic meaning and contextual relationships between words. ((https://www.wallstreetprep.com/knowledge/semantic-search/|Wall Street Prep: Semantic Search)) - **Indexing**: Document embeddings are stored in a vector database or search index optimized for similarity queries. - **Querying**: The user's query is encoded into an embedding using the same model. - **Retrieval**: The query embedding is compared to stored document embeddings using similarity metrics, and top-k most similar documents are returned. ===== Types of Embeddings ===== **Dense embeddings**: High-dimensional vectors (typically 768 to 3,072 dimensions) from transformer models, capturing rich semantic information across all dimensions. Most common for semantic search. ((https://www.wallstreetprep.com/knowledge/semantic-search/|Wall Street Prep: Semantic Search)) **Sparse embeddings**: High-dimensional but mostly zero vectors emphasizing key terms (e.g., SPLADE, learned sparse representations). Computationally lighter and interpretable, bridging the gap between keyword and semantic search. ===== Vector Similarity Metrics ===== ^ Metric ^ Description ^ Typical Use ^ | **Cosine similarity** | Measures the angle between vectors (range -1 to 1), ignoring magnitude | Most common for text; insensitive to document length differences | | **Dot product** | Scalar product of vectors; faster than cosine but sensitive to magnitude unless normalized | Optimized vector databases with normalized embeddings | | **Euclidean distance** | Straight-line distance in vector space; penalizes magnitude differences | Less common for text; used in some ANN configurations | Cosine similarity is the default choice for semantic search because it handles normalization differences between documents of varying lengths. ((https://www.merge.dev/blog/semantic-search|Merge: Semantic Search)) ===== Semantic Search vs Keyword Search ===== ^ Aspect ^ Semantic Search ^ Keyword Search ^ | Matching | Intent, synonyms, context via vector similarity | Exact words and phrases | | Strengths | Handles varied phrasing, captures meaning | Fast, simple, precise for exact terms | | Weaknesses | Compute-intensive, embedding quality dependent | Misses synonyms, no understanding of meaning | | Example | "affordable smartphones with good cameras" finds relevant products | Requires exact terms like "cheap phone camera" | ((https://www.meilisearch.com/blog/semantic-search|Meilisearch: Semantic Search)) ((https://www.couchbase.com/blog/what-is-semantic-search/|Couchbase: Semantic Search)) ===== Vector Databases ===== Vector databases store and query embeddings efficiently at scale: ((https://www.merge.dev/blog/semantic-search|Merge: Semantic Search)) * **Pinecone**: Fully managed, serverless, optimized for production scale * **Weaviate**: Open-source with hybrid search, GraphQL API, and modular architecture * **Qdrant**: Open-source with advanced filtering, sparse+dense support, and Rust-based performance * **Milvus**: Open-source, highly scalable with sharding and replication for billion-scale datasets * **Chroma**: Lightweight, embeddable for local development and rapid prototyping * **pgvector**: PostgreSQL extension enabling hybrid SQL and vector search in existing infrastructure ===== Approximate Nearest Neighbor Algorithms ===== Exact k-nearest neighbor search is too slow for large datasets (millions to billions of vectors). ANN algorithms trade small accuracy losses for dramatically faster queries: ((https://www.elastic.co/what-is/semantic-search|Elastic: Semantic Search)) * **HNSW (Hierarchical Navigable Small World)**: Graph-based algorithm excelling in high-recall, low-latency queries. The most widely used ANN algorithm in production vector databases. * **IVF (Inverted File Index)**: Clusters vectors into cells and searches only top-k clusters, balancing speed and accuracy. Good for very large datasets. * **ScaNN (Scalable Nearest Neighbors)**: Google's approach using anisotropic quantization for ultra-fast search on dense embeddings. ===== Applications ===== * **E-commerce**: Product recommendations via purchase intent understanding rather than exact keyword matching ((https://www.meilisearch.com/blog/semantic-search|Meilisearch: Semantic Search)) * **Enterprise search**: Finding contextually relevant documents across large corporate knowledge bases ((https://www.merge.dev/blog/semantic-search|Merge: Semantic Search)) * **RAG**: Fetching semantically relevant chunks for LLMs to generate accurate, grounded responses * **Customer support**: Matching support tickets to relevant knowledge base articles regardless of phrasing * **Legal and compliance**: Finding relevant precedents and regulations based on conceptual similarity ===== Limitations ===== * **Computational cost**: Embedding generation and large-scale similarity search require significant compute resources ((https://www.merge.dev/blog/semantic-search|Merge: Semantic Search)) * **Embedding quality**: Performance depends on the model; biases or poor generalization to niche domains can degrade results * **Semantic drift**: Vectors may be similar in embedding space without being truly relevant (false positives) * **Scalability**: Massive datasets require ANN optimization and careful index management * **Exact match weakness**: Pure semantic search may miss specific identifiers, codes, or proper nouns ===== Best Practices ===== * Use hybrid search (semantic + keyword) for the best precision and recall balance ((https://www.meilisearch.com/blog/semantic-search|Meilisearch: Semantic Search)) * Normalize embeddings and use cosine similarity for text * Deploy ANN algorithms (HNSW is the default choice) in vector databases * Fine-tune embedding models on domain-specific data for improved relevance * Monitor for drift with A/B testing and regular evaluation * Scale with sharding and replication for production workloads * Add a reranking stage for precision-critical applications ===== See Also ===== * [[hybrid_search|Hybrid Search]] * [[embedding_models_comparison|Embedding Models Comparison]] * [[retrieval_strategies|Retrieval Strategies]] * [[reranking|Reranking]] ===== References =====