Table of Contents

Vector Embeddings

Vector embeddings are dense numerical representations — typically lists of floating-point numbers in high-dimensional space (256–4096 dimensions) — that encode complex data like words, sentences, images, or audio while preserving semantic relationships and enabling mathematical operations. 1)

Mathematically, an embedding for input x is a function f(x) = v where v is a vector in d-dimensional real space, learned via neural networks to position similar items close together. Each dimension encodes latent features, allowing operations like vector arithmetic to capture analogies such as king - man + woman ≈ queen. 2)

Word2Vec

Introduced in the 2013 paper “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al., Word2Vec trains shallow neural networks on word co-occurrences. 3)

Word2Vec demonstrated that word relationships could be captured geometrically, establishing the foundation for all subsequent embedding work.

GloVe

GloVe (Global Vectors), introduced by Pennington et al. in 2014, builds on co-occurrence statistics via matrix factorization. It optimizes log-bilinear models on global word-word co-occurrence matrices, achieving faster training and strong analogy performance compared to prediction-based methods. 4)

Sentence Transformers

Sentence-BERT (SBERT), introduced by Reimers and Gurevych in 2019, fine-tunes BERT with siamese and triplet networks using cosine similarity on sentence pairs. This enables efficient sentence-level embeddings (typically 768 dimensions) through pooling strategies like mean pooling or CLS token extraction. 5)

SBERT dramatically reduced the computational cost of finding similar sentences from hours to milliseconds by producing fixed-size vectors that can be compared directly.

Modern Embedding Models

Similarity between embeddings is measured using distance metrics in the embedding space:

Vector Databases

Vector databases store and query embeddings at scale using approximate nearest neighbor (ANN) indexes such as HNSW (Hierarchical Navigable Small World) and IVF (Inverted File Index):

These databases handle billions of vectors with metadata filtering and scalar quantization for storage compression. 9)

Use in RAG Pipelines

In Retrieval-Augmented Generation (RAG), embeddings encode both queries and documents. Similarity search retrieves the top-k most relevant chunks from vector databases to ground LLM responses, reducing hallucinations. 10)

The typical RAG pipeline follows: embed documents → index in vector database → embed query → retrieve similar chunks → rerank results → generate response with retrieved context.

Dimensionality Reduction

Techniques like PCA (Principal Component Analysis) or t-SNE project high-dimensional embeddings down to lower dimensions for visualization or storage efficiency. This can provide 10–100x storage savings and faster search while preserving relative distances between points. 11)

Fine-Tuning Embeddings

General-purpose embedding models can be fine-tuned on domain-specific data using contrastive loss on pairs of similar and dissimilar examples. This adapts models like SBERT for specialized tasks in domains such as legal or medical search, improving performance on domain-specific benchmarks.

Benchmarks

The MTEB (Massive Text Embedding Benchmark) evaluates embedding models across approximately 56 tasks including retrieval, clustering, semantic textual similarity, and classification. Current leaderboards are hosted at HuggingFace. 12)

Applications

See Also

References