Vector embeddings are dense numerical representations — typically lists of floating-point numbers in high-dimensional space (256–4096 dimensions) — that encode complex data like words, sentences, images, or audio while preserving semantic relationships and enabling mathematical operations. 1)
Mathematically, an embedding for input x is a function f(x) = v where v is a vector in d-dimensional real space, learned via neural networks to position similar items close together. Each dimension encodes latent features, allowing operations like vector arithmetic to capture analogies such as king - man + woman ≈ queen. 2)
Introduced in the 2013 paper “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al., Word2Vec trains shallow neural networks on word co-occurrences. 3)
Word2Vec demonstrated that word relationships could be captured geometrically, establishing the foundation for all subsequent embedding work.
GloVe (Global Vectors), introduced by Pennington et al. in 2014, builds on co-occurrence statistics via matrix factorization. It optimizes log-bilinear models on global word-word co-occurrence matrices, achieving faster training and strong analogy performance compared to prediction-based methods. 4)
Sentence-BERT (SBERT), introduced by Reimers and Gurevych in 2019, fine-tunes BERT with siamese and triplet networks using cosine similarity on sentence pairs. This enables efficient sentence-level embeddings (typically 768 dimensions) through pooling strategies like mean pooling or CLS token extraction. 5)
SBERT dramatically reduced the computational cost of finding similar sentences from hours to milliseconds by producing fixed-size vectors that can be compared directly.
Similarity between embeddings is measured using distance metrics in the embedding space:
Vector databases store and query embeddings at scale using approximate nearest neighbor (ANN) indexes such as HNSW (Hierarchical Navigable Small World) and IVF (Inverted File Index):
These databases handle billions of vectors with metadata filtering and scalar quantization for storage compression. 9)
In Retrieval-Augmented Generation (RAG), embeddings encode both queries and documents. Similarity search retrieves the top-k most relevant chunks from vector databases to ground LLM responses, reducing hallucinations. 10)
The typical RAG pipeline follows: embed documents → index in vector database → embed query → retrieve similar chunks → rerank results → generate response with retrieved context.
Techniques like PCA (Principal Component Analysis) or t-SNE project high-dimensional embeddings down to lower dimensions for visualization or storage efficiency. This can provide 10–100x storage savings and faster search while preserving relative distances between points. 11)
General-purpose embedding models can be fine-tuned on domain-specific data using contrastive loss on pairs of similar and dissimilar examples. This adapts models like SBERT for specialized tasks in domains such as legal or medical search, improving performance on domain-specific benchmarks.
The MTEB (Massive Text Embedding Benchmark) evaluates embedding models across approximately 56 tasks including retrieval, clustering, semantic textual similarity, and classification. Current leaderboards are hosted at HuggingFace. 12)