Table of Contents

Embeddings

Embeddings (also called vector embeddings) are dense vector representations that capture semantic meaning of text, images, and other data in continuous high-dimensional space. They are the foundation of semantic search, retrieval-augmented generation, clustering, and classification in AI agent systems. Choosing the right embedding model directly impacts retrieval quality, agent accuracy, and operational costs.

Vector embeddings are dense numerical representations — typically lists of floating-point numbers in high-dimensional space (256–4096 dimensions) — that encode complex data like words, sentences, images, or audio while preserving semantic relationships and enabling mathematical operations. 1)

Definition and Fundamentals

A vector embedding maps data from its original representation into a vector space of fixed dimensionality, typically ranging from 50 to several thousand dimensions depending on the model and use case. Each dimension captures learned features or patterns from the training data. The key property of well-constructed embeddings is that semantically similar items are positioned closer together in the vector space, while dissimilar items are farther apart.

Embedding models transform input data into fixed-size numerical vectors $\mathbf{x} \in \mathbb{R}^d$ where semantically similar items are positioned close together in vector space. Mathematically, an embedding for input x is a function f(x) = v where v is a vector in d-dimensional real space, learned via neural networks to position similar items close together. Each dimension encodes latent features, allowing operations like vector arithmetic to capture analogies such as king - man + woman ≈ queen. 2)

The dimensionality of embeddings represents a tradeoff between representational capacity and computational efficiency. Modern large language models typically use embeddings with 768 to 4096 dimensions, though this varies based on model architecture and specific applications 3).

How Embeddings Work

The similarity between two embeddings is typically measured using distance metrics such as cosine similarity, Euclidean distance, or dot product. Cosine similarity is most commonly used:

$$\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||}$$

This enables:

Generation and Training Methods

Word and Document Embeddings

Word2Vec, introduced in the 2013 paper “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al., trains shallow neural networks on word co-occurrences. 4)

Word2Vec demonstrated that word relationships could be captured geometrically, establishing the foundation for all subsequent embedding work.

GloVe (Global Vectors), introduced by Pennington et al. in 2014, combines global matrix factorization with local context window methods to create embeddings that capture both global and local co-occurrence statistics.

Transformer-based Embeddings

Modern approaches use transformer models to generate contextual embeddings where the same word's representation varies based on surrounding context. These models like BERT, GPT, and specialized embedding models produce high-quality representations suitable for semantic search and similarity tasks 6).

Multimodal Embeddings

Systems like CLIP and others generate embeddings that unify multiple data modalities (text and images), enabling cross-modal similarity search and comparison operations.

See Also

References