Differences

This shows you the differences between two versions of the page.

--- embeddings [2026/03/24 16:42] – Create embeddings page with researched content agent
+++ embeddings [2026/03/24 17:41] (current) – Add LaTeX math formatting agent
@@ Line 5: / Line 5: @@
 ===== How Embeddings Work =====
-Embedding models transform input data into fixed-size numerical vectors where semantically similar items are positioned close together in vector space. This enables:
+Embedding models transform input data into fixed-size numerical vectors $\mathbf{x} \in \mathbb{R}^d$ where semantically similar items are positioned close together in vector space. The similarity between two embeddings is typically measured using cosine similarity:
+$$\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||}$$
+This enables:
   * **Semantic search** — Find relevant documents by meaning rather than keyword matching
@@ Line 33: / Line 37: @@
 def embed_texts(texts: list[str], model="text-embedding-3-large") -> np.ndarray:
-    """Embed a batch of texts using OpenAI's API."""
+    """Embed a batch of texts using OpenAI API."""
     response = client.embeddings.create(input=texts, model=model)
     return np.array([item.embedding for item in response.data])
@@ Line 57: / Line 61: @@
 ===== Dimensionality Considerations =====
-The number of dimensions in an embedding affects the trade-off between semantic precision and computational cost:
+The number of dimensions $d$ in an embedding affects the trade-off between semantic precision and computational cost:
-  * **Higher dimensions (2048-3072)** — Capture more nuanced semantic distinctions but require more storage, memory, and compute for similarity search
+  * **Higher dimensions ($d = 2048$-$3072$)** — Capture more nuanced semantic distinctions but require more storage, memory, and compute for similarity search
-  * **Medium dimensions (768-1024)** — The sweet spot for most agent applications, balancing quality and efficiency
+  * **Medium dimensions ($d = 768$-$1024$)** — The sweet spot for most agent applications, balancing quality and efficiency
-  * **Lower dimensions (256-512)** — Suitable for large-scale applications where speed and cost are prioritized over precision
+  * **Lower dimensions ($d = 256$-$512$)** — Suitable for large-scale applications where speed and cost are prioritized over precision
 **Practical guidance:** Start with medium dimensions (768-1024). Only scale up if retrieval quality benchmarks show meaningful improvement. Use dimensionality reduction (PCA, Matryoshka embeddings) to test whether lower dimensions maintain acceptable recall.
@@ Line 67: / Line 71: @@
 ===== Multi-Modal Embeddings =====
-Multi-modal embedding models project different data types (text, images, audio) into a shared vector space, enabling cross-modal search:
+Multi-modal embedding models project different data types (text, images, audio) into a shared vector space $\mathbb{R}^d$, enabling cross-modal search:
   * **CLIP and variants** — Align image and text embeddings for visual search
@@ Line 95: / Line 99: @@
 ===== Vector Similarity Search =====
-Embeddings are stored and queried in vector databases using approximate nearest neighbor (ANN) algorithms:
+Embeddings are stored and queried in vector databases using [[approximate_nearest_neighbors|approximate nearest neighbor]] (ANN) algorithms. The most common distance metrics are:
-  * **HNSW** (Hierarchical Navigable Small World) — Best recall/speed trade-off, used by most vector databases
+  * **Cosine similarity**: $\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||}$ — measures angular closeness, invariant to magnitude
+  * **Euclidean distance (L2)**: $d(\mathbf{a}, \mathbf{b}) = ||\mathbf{a} - \mathbf{b}||_2 = \sqrt{\sum_{i=1}^{d}(a_i - b_i)^2}$ — measures absolute distance in vector space
+  * **Dot product**: $\langle \mathbf{a}, \mathbf{b} \rangle = \sum_{i=1}^{d} a_i b_i$ — used for [[maximum_inner_product_search|MIPS]] when magnitudes carry meaning
+Key ANN implementations include:
+  * **[[hnsw_graphs|HNSW]]** (Hierarchical Navigable Small World) — Best recall/speed trade-off, used by most vector databases
   * **IVF** (Inverted File Index) — Good for very large collections with acceptable recall trade-offs
-  * **FAISS** — Meta's library for efficient similarity search at scale, supports GPU acceleration
+  * **[[faiss|FAISS]]** — Meta's library for efficient similarity search at scale, supports GPU acceleration
 ===== References =====
@@ Line 113: / Line 123: @@
   * [[agent_memory_frameworks]] — Memory systems using embedding-based retrieval
   * [[fine_tuning_agents]] — Fine-tuning embedding models for domain-specific tasks

AI Agent Knowledge Base

User Tools

Site Tools

Differences

Page Tools