Semantic Search
Semantic search is a search technique that understands the intent and contextual meaning behind queries using natural language processing and machine learning, rather than relying on exact keyword matches. It works by converting queries and documents into vector embeddings and measuring their similarity in high-dimensional space. 1) 2)
How Semantic Search Works
Encoding: Text (queries and documents) is transformed into numerical vectors (embeddings) using transformer models like BERT, MPNet, or dedicated embedding models. These vectors capture semantic meaning and contextual relationships between words.
3)
Indexing: Document embeddings are stored in a vector database or search index optimized for similarity queries.
Querying: The user's query is encoded into an embedding using the same model.
Retrieval: The query embedding is compared to stored document embeddings using similarity metrics, and top-k most similar documents are returned.
Types of Embeddings
Dense embeddings: High-dimensional vectors (typically 768 to 3,072 dimensions) from transformer models, capturing rich semantic information across all dimensions. Most common for semantic search. 4)
Sparse embeddings: High-dimensional but mostly zero vectors emphasizing key terms (e.g., SPLADE, learned sparse representations). Computationally lighter and interpretable, bridging the gap between keyword and semantic search.
Vector Similarity Metrics
| Metric | Description | Typical Use |
| Cosine similarity | Measures the angle between vectors (range -1 to 1), ignoring magnitude | Most common for text; insensitive to document length differences |
| Dot product | Scalar product of vectors; faster than cosine but sensitive to magnitude unless normalized | Optimized vector databases with normalized embeddings |
| Euclidean distance | Straight-line distance in vector space; penalizes magnitude differences | Less common for text; used in some ANN configurations |
Cosine similarity is the default choice for semantic search because it handles normalization differences between documents of varying lengths. 5)
Semantic Search vs Keyword Search
| Aspect | Semantic Search | Keyword Search |
| Matching | Intent, synonyms, context via vector similarity | Exact words and phrases |
| Strengths | Handles varied phrasing, captures meaning | Fast, simple, precise for exact terms |
| Weaknesses | Compute-intensive, embedding quality dependent | Misses synonyms, no understanding of meaning |
| Example | “affordable smartphones with good cameras” finds relevant products | Requires exact terms like “cheap phone camera” |
6) 7)
Vector Databases
Vector databases store and query embeddings efficiently at scale: 8)
Pinecone: Fully managed, serverless, optimized for production scale
Weaviate: Open-source with hybrid search, GraphQL
API, and modular architecture
Qdrant: Open-source with advanced filtering, sparse+dense support, and Rust-based performance
Milvus: Open-source, highly scalable with sharding and replication for billion-scale datasets
Chroma: Lightweight, embeddable for local development and rapid prototyping
pgvector: PostgreSQL extension enabling hybrid SQL and vector search in existing infrastructure
Approximate Nearest Neighbor Algorithms
Exact k-nearest neighbor search is too slow for large datasets (millions to billions of vectors). ANN algorithms trade small accuracy losses for dramatically faster queries: 9)
HNSW (Hierarchical Navigable Small World): Graph-based algorithm excelling in high-recall, low-latency queries. The most widely used ANN algorithm in production vector databases.
IVF (Inverted File Index): Clusters vectors into cells and searches only top-k clusters, balancing speed and accuracy. Good for very large datasets.
ScaNN (Scalable Nearest Neighbors): Google's approach using anisotropic quantization for ultra-fast search on dense embeddings.
Applications
E-commerce: Product recommendations via purchase intent understanding rather than exact keyword matching
10)
Enterprise search: Finding contextually relevant documents across large corporate knowledge bases
11)
RAG: Fetching semantically relevant chunks for LLMs to generate accurate, grounded responses
Customer support: Matching support tickets to relevant knowledge base articles regardless of phrasing
Legal and compliance: Finding relevant precedents and regulations based on conceptual similarity
Limitations
Computational cost: Embedding generation and large-scale similarity search require significant compute resources
12)
Embedding quality: Performance depends on the model; biases or poor generalization to niche domains can degrade results
Semantic drift: Vectors may be similar in embedding space without being truly relevant (false positives)
Scalability: Massive datasets require ANN optimization and careful index management
Exact match weakness: Pure semantic search may miss specific identifiers, codes, or proper nouns
Best Practices
Use hybrid search (semantic + keyword) for the best precision and recall balance
13)
Normalize embeddings and use cosine similarity for text
Deploy ANN algorithms (HNSW is the default choice) in vector databases
Fine-tune embedding models on domain-specific data for improved relevance
Monitor for drift with A/B testing and regular evaluation
Scale with sharding and replication for production workloads
Add a reranking stage for precision-critical applications
See Also
References