====== Embeddings ====== Embeddings are dense vector representations that capture semantic meaning of text, images, and other data in continuous high-dimensional space. They are the foundation of semantic search, retrieval-augmented generation, clustering, and classification in AI agent systems. Choosing the right embedding model directly impacts retrieval quality, agent accuracy, and operational costs. ===== How Embeddings Work ===== Embedding models transform input data into fixed-size numerical vectors $\mathbf{x} \in \mathbb{R}^d$ where semantically similar items are positioned close together in vector space. The similarity between two embeddings is typically measured using cosine similarity: $$\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||}$$ This enables: * **Semantic search** — Find relevant documents by meaning rather than keyword matching * **RAG retrieval** — Power the retrieval stage of [[retrieval_augmented_generation|retrieval-augmented generation]] * **Clustering** — Group similar items for analysis and organization * **Classification** — Use vector proximity for categorization tasks * **Anomaly detection** — Identify outliers in embedding space ===== Text Embedding Models ===== | **Model** | **Provider** | **Dimensions** | **MTEB Score** | **Best For** | | text-embedding-3-large | OpenAI | 3072 | ~70% | General-purpose agent retrieval | | text-embedding-3-small | OpenAI | 1536 | ~65% | Cost-effective applications | | Embed v4 | Cohere | 1024-4096 | ~68% | Multilingual enterprise agents | | BGE-M3 | BAAI | 768-1024 | ~68% | Open-source RAG, cost-sensitive | | nomic-embed-text | Nomic | 768 | ~66% | Low-latency, edge deployment | | Voyage 3 | Voyage AI | 1024 | ~69% | Long-context retrieval | | jina-embeddings-v3 | Jina AI | 1024 | ~67% | Multilingual, code embeddings | ===== Example: Embedding Pipeline for Agents ===== import numpy as np from openai import OpenAI client = OpenAI() def embed_texts(texts: list[str], model="text-embedding-3-large") -> np.ndarray: """Embed a batch of texts using OpenAI API.""" response = client.embeddings.create(input=texts, model=model) return np.array([item.embedding for item in response.data]) def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # Embed documents and query documents = [ "RAG combines retrieval with generation for grounded responses", "Fine-tuning adapts model weights on domain-specific data", "Prompt engineering designs inputs to guide model behavior" ] doc_embeddings = embed_texts(documents) query_embedding = embed_texts(["How do I ground agent responses in facts?"])[0] # Find most relevant document similarities = [cosine_similarity(query_embedding, doc) for doc in doc_embeddings] best_match = documents[np.argmax(similarities)] print(f"Most relevant: {best_match}") # RAG document ===== Dimensionality Considerations ===== The number of dimensions $d$ in an embedding affects the trade-off between semantic precision and computational cost: * **Higher dimensions ($d = 2048$-$3072$)** — Capture more nuanced semantic distinctions but require more storage, memory, and compute for similarity search * **Medium dimensions ($d = 768$-$1024$)** — The sweet spot for most agent applications, balancing quality and efficiency * **Lower dimensions ($d = 256$-$512$)** — Suitable for large-scale applications where speed and cost are prioritized over precision **Practical guidance:** Start with medium dimensions (768-1024). Only scale up if retrieval quality benchmarks show meaningful improvement. Use dimensionality reduction (PCA, Matryoshka embeddings) to test whether lower dimensions maintain acceptable recall. ===== Multi-Modal Embeddings ===== Multi-modal embedding models project different data types (text, images, audio) into a shared vector space $\mathbb{R}^d$, enabling cross-modal search: * **CLIP and variants** — Align image and text embeddings for visual search * **ImageBind (Meta)** — Unified embeddings across text, image, audio, video, depth, and thermal * **Jina CLIP v2** — Open-source text-image embeddings with multilingual support * **Cohere Embed v4** — Handles text, images, and mixed documents in a single model Cross-modal search enables agents to find relevant images from text queries or match text descriptions to visual content. ===== MTEB Benchmark ===== The [[https://huggingface.co/spaces/mteb/leaderboard|Massive Text Embedding Benchmark (MTEB)]] evaluates embedding models across 56+ tasks spanning retrieval, classification, clustering, reranking, and semantic similarity. Key findings for 2026: * **Top proprietary:** OpenAI text-embedding-3-large (~70%), Voyage 3 (~69%) * **Top open-source:** BGE-M3 (~68%), Jina v3 (~67%), Nomic (~66%) * **Real-world gap:** Models that score well on MTEB may underperform on specific domains — always evaluate on your own data * **Non-English:** Cohere and BGE excel on multilingual benchmarks ===== Embedding Selection Criteria ===== * **Task fit** — Retrieval-optimized models (BGE, Voyage) for RAG; general models (OpenAI) for broad applications * **Cost/latency** — Open-source models for high-volume agents; proprietary for peak accuracy * **Domain/language** — Multilingual models (Cohere, BGE-M3) for global agents; domain-specific fine-tuned models for specialized retrieval * **Infrastructure** — Consider whether you need API-based (OpenAI, Cohere) or self-hosted (BGE, Nomic) models * **Dimensionality** — Higher dimensions for precision-critical applications; lower for scale and speed ===== Vector Similarity Search ===== Embeddings are stored and queried in vector databases using [[approximate_nearest_neighbors|approximate nearest neighbor]] (ANN) algorithms. The most common distance metrics are: * **Cosine similarity**: $\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||}$ — measures angular closeness, invariant to magnitude * **Euclidean distance (L2)**: $d(\mathbf{a}, \mathbf{b}) = ||\mathbf{a} - \mathbf{b}||_2 = \sqrt{\sum_{i=1}^{d}(a_i - b_i)^2}$ — measures absolute distance in vector space * **Dot product**: $\langle \mathbf{a}, \mathbf{b} \rangle = \sum_{i=1}^{d} a_i b_i$ — used for [[maximum_inner_product_search|MIPS]] when magnitudes carry meaning Key ANN implementations include: * **[[hnsw_graphs|HNSW]]** (Hierarchical Navigable Small World) — Best recall/speed trade-off, used by most vector databases * **IVF** (Inverted File Index) — Good for very large collections with acceptable recall trade-offs * **[[faiss|FAISS]]** — Meta's library for efficient similarity search at scale, supports GPU acceleration ===== References ===== * [[https://huggingface.co/spaces/mteb/leaderboard|MTEB Leaderboard]] * [[https://platform.openai.com/docs/guides/embeddings|OpenAI Embeddings Guide]] * [[https://www.stack-ai.com/insights/best-embedding-models-for-rag-in-2026-a-comparison-guide|Best Embedding Models for RAG 2026]] ===== See Also ===== * [[retrieval_augmented_generation]] — RAG systems powered by embeddings * [[knowledge_graphs]] — Graph representations complementing vector search * [[agent_memory_frameworks]] — Memory systems using embedding-based retrieval * [[fine_tuning_agents]] — Fine-tuning embedding models for domain-specific tasks