Word2Vec
GloVe
Sentence Transformers
Modern Embedding Models
Similarity Search
Vector Databases
Use in RAG Pipelines
Dimensionality Reduction
Fine-Tuning Embeddings
Benchmarks
Applications
See Also
References

Vector Embeddings

Vector embeddings are dense numerical representations — typically lists of floating-point numbers in high-dimensional space (256–4096 dimensions) — that encode complex data like words, sentences, images, or audio while preserving semantic relationships and enabling mathematical operations. ¹⁾

Mathematically, an embedding for input x is a function f(x) = v where v is a vector in d-dimensional real space, learned via neural networks to position similar items close together. Each dimension encodes latent features, allowing operations like vector arithmetic to capture analogies such as king - man + woman ≈ queen. ²⁾

Word2Vec

Introduced in the 2013 paper “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al., Word2Vec trains shallow neural networks on word co-occurrences. ³⁾

CBOW (Continuous Bag-of-Words): Predicts a target word from its context words by averaging context vectors as input.
Skip-gram: Reverses the CBOW approach, predicting context words from the target word. Better handles rare words and uses negative sampling for efficiency.

Word2Vec demonstrated that word relationships could be captured geometrically, establishing the foundation for all subsequent embedding work.

GloVe

GloVe (Global Vectors), introduced by Pennington et al. in 2014, builds on co-occurrence statistics via matrix factorization. It optimizes log-bilinear models on global word-word co-occurrence matrices, achieving faster training and strong analogy performance compared to prediction-based methods. ⁴⁾

Sentence Transformers

Sentence-BERT (SBERT), introduced by Reimers and Gurevych in 2019, fine-tunes BERT with siamese and triplet networks using cosine similarity on sentence pairs. This enables efficient sentence-level embeddings (typically 768 dimensions) through pooling strategies like mean pooling or CLS token extraction. ⁵⁾

SBERT dramatically reduced the computational cost of finding similar sentences from hours to milliseconds by producing fixed-size vectors that can be compared directly.

Modern Embedding Models

OpenAI text-embedding-3: Produces 1536–3072 dimensional vectors via transformer inference, excelling in semantic capture. Available in small and large variants with configurable dimensionality. ⁶⁾
Cohere Embed: API-based embeddings optimized for multilingual tasks and retrieval, supporting quantization to 8-bit integers for storage efficiency.
Google Gecko: Part of the Gemini embeddings family, offering 1024+ dimensional vectors optimized for retrieval and classification tasks.

Similarity Search

Similarity between embeddings is measured using distance metrics in the embedding space:

Cosine similarity: Measures the angle between two vectors, providing scale-invariant directional comparison. Values range from -1 to 1, with semantically similar items typically yielding scores above 0.8. ⁷⁾
Dot product: Combines magnitude and direction, enabling faster approximate nearest neighbor (ANN) search. Assumes unit-normalized vectors for pure directional comparison. ⁸⁾
Euclidean distance: Measures straight-line distance between vectors, useful when magnitude differences are meaningful.

Vector Databases

Vector databases store and query embeddings at scale using approximate nearest neighbor (ANN) indexes such as HNSW (Hierarchical Navigable Small World) and IVF (Inverted File Index):

Pinecone: Serverless vector database with hybrid search capabilities combining dense and sparse vectors.
Weaviate: Combines graph and vector search with a modular architecture.
Chroma: Open-source, lightweight vector database designed for local development and small-scale deployments.
Qdrant: Rust-based vector database with advanced filtering capabilities and high performance.
pgvector: PostgreSQL extension adding vector similarity search to existing relational databases.

These databases handle billions of vectors with metadata filtering and scalar quantization for storage compression. ⁹⁾

Use in RAG Pipelines

In Retrieval-Augmented Generation (RAG), embeddings encode both queries and documents. Similarity search retrieves the top-k most relevant chunks from vector databases to ground LLM responses, reducing hallucinations. ¹⁰⁾

The typical RAG pipeline follows: embed documents → index in vector database → embed query → retrieve similar chunks → rerank results → generate response with retrieved context.

Dimensionality Reduction

Techniques like PCA (Principal Component Analysis) or t-SNE project high-dimensional embeddings down to lower dimensions for visualization or storage efficiency. This can provide 10–100x storage savings and faster search while preserving relative distances between points. ¹¹⁾

Fine-Tuning Embeddings

General-purpose embedding models can be fine-tuned on domain-specific data using contrastive loss on pairs of similar and dissimilar examples. This adapts models like SBERT for specialized tasks in domains such as legal or medical search, improving performance on domain-specific benchmarks.

Benchmarks

The MTEB (Massive Text Embedding Benchmark) evaluates embedding models across approximately 56 tasks including retrieval, clustering, semantic textual similarity, and classification. Current leaderboards are hosted at HuggingFace. ¹²⁾

Applications

Semantic search: Going beyond keyword matching to find conceptually related content (used by Google, Bing)
Recommendation systems: Finding similar items based on embedding proximity (Netflix, Spotify)
RAG chatbots: Grounding LLM responses in retrieved factual content
Image retrieval: CLIP embeddings enable cross-modal search between text and images
Fraud detection: Transaction embeddings identify anomalous patterns
Clustering and classification: Grouping similar documents or categorizing text