Embeddings

Embeddings (also called vector embeddings) are dense vector representations that capture semantic meaning of text, images, and other data in continuous high-dimensional space. They are the foundation of semantic search, retrieval-augmented generation, clustering, and classification in AI agent systems. Choosing the right embedding model directly impacts retrieval quality, agent accuracy, and operational costs.

Vector embeddings are dense numerical representations — typically lists of floating-point numbers in high-dimensional space (256–4096 dimensions) — that encode complex data like words, sentences, images, or audio while preserving semantic relationships and enabling mathematical operations. ¹⁾

Definition and Fundamentals

A vector embedding maps data from its original representation into a vector space of fixed dimensionality, typically ranging from 50 to several thousand dimensions depending on the model and use case. Each dimension captures learned features or patterns from the training data. The key property of well-constructed embeddings is that semantically similar items are positioned closer together in the vector space, while dissimilar items are farther apart.

Embedding models transform input data into fixed-size numerical vectors $\mathbf{x} \in \mathbb{R}^d$ where semantically similar items are positioned close together in vector space. Mathematically, an embedding for input x is a function f(x) = v where v is a vector in d-dimensional real space, learned via neural networks to position similar items close together. Each dimension encodes latent features, allowing operations like vector arithmetic to capture analogies such as king - man + woman ≈ queen. ²⁾

The dimensionality of embeddings represents a tradeoff between representational capacity and computational efficiency. Modern large language models typically use embeddings with 768 to 4096 dimensions, though this varies based on model architecture and specific applications ³⁾.

How Embeddings Work

The similarity between two embeddings is typically measured using distance metrics such as cosine similarity, Euclidean distance, or dot product. Cosine similarity is most commonly used:

$$\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||}$$

This enables:

Semantic search — Find relevant documents by meaning rather than keyword matching, going beyond exact keyword matching through similarity-based retrieval
RAG retrieval — Power the retrieval stage of retrieval-augmented generation
Clustering — Group similar items for analysis and organization
Classification — Use vector proximity for categorization tasks
Anomaly detection — Identify outliers in embedding space

Generation and Training Methods

Word and Document Embeddings

Word2Vec, introduced in the 2013 paper “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al., trains shallow neural networks on word co-occurrences. ⁴⁾

CBOW (Continuous Bag-of-Words): Predicts a target word from its context words by averaging context vectors as input.
Skip-gram: Reverses the CBOW approach, predicting context words from the target word. Better handles rare words and uses negative sampling for efficiency ⁵⁾.

Word2Vec demonstrated that word relationships could be captured geometrically, establishing the foundation for all subsequent embedding work.

GloVe (Global Vectors), introduced by Pennington et al. in 2014, combines global matrix factorization with local context window methods to create embeddings that capture both global and local co-occurrence statistics.

Transformer-based Embeddings

Modern approaches use transformer models to generate contextual embeddings where the same word's representation varies based on surrounding context. These models like BERT, GPT, and specialized embedding models produce high-quality representations suitable for semantic search and similarity tasks ⁶⁾.

Multimodal Embeddings

Systems like CLIP and others generate embeddings that unify multiple data modalities (text and images), enabling cross-modal similarity search and comparison operations.