====== Embeddings ====== Embeddings (also called **vector embeddings**) are dense vector representations that capture semantic meaning of text, images, and other data in continuous high-dimensional space. They are the foundation of [[semantic_search|semantic search]], retrieval-augmented generation, clustering, and classification in AI agent systems. Choosing the right embedding model directly impacts retrieval quality, agent accuracy, and operational costs. Vector embeddings are dense numerical representations — typically lists of floating-point numbers in high-dimensional space (256–4096 dimensions) — that encode complex data like words, sentences, images, or audio while preserving semantic relationships and enabling mathematical operations. ((source [[https://www.yugabyte.com/key-concepts/what-is-vector-embedding/|Yugabyte: What is Vector Embedding]])) ===== Definition and Fundamentals ===== A vector embedding maps data from its original representation into a vector space of fixed dimensionality, typically ranging from 50 to several thousand dimensions depending on the model and use case. Each dimension captures learned features or patterns from the training data. The key property of well-constructed embeddings is that semantically similar items are positioned closer together in the vector space, while dissimilar items are farther apart. Embedding models transform input data into fixed-size numerical vectors $\mathbf{x} \in \mathbb{R}^d$ where semantically similar items are positioned close together in vector space. Mathematically, an embedding for input x is a function f(x) = v where v is a vector in d-dimensional real space, learned via neural networks to position similar items close together. Each dimension encodes latent features, allowing operations like vector arithmetic to capture analogies such as king - man + woman ≈ queen. ((source [[https://www.augustschools.com/blog/august-intelligence-embeddings/|August Intelligence: Embeddings]])) The dimensionality of embeddings represents a tradeoff between representational capacity and computational efficiency. Modern large language models typically use embeddings with 768 to 4096 dimensions, though this varies based on model architecture and specific applications (([[https://arxiv.org/abs/1910.01108|Devlin et al. - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019]])). ===== How Embeddings Work ===== The similarity between two embeddings is typically measured using distance metrics such as cosine similarity, Euclidean distance, or dot product. Cosine similarity is most commonly used: $$\text{sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{||\mathbf{a}|| \cdot ||\mathbf{b}||}$$ This enables: * **[[semantic_search|Semantic search]]** — Find relevant documents by meaning rather than keyword matching, going beyond exact keyword matching through similarity-based retrieval * **RAG retrieval** — Power the retrieval stage of [[retrieval_augmented_generation|retrieval-augmented generation]] * **Clustering** — Group similar items for analysis and organization * **Classification** — Use vector proximity for categorization tasks * **Anomaly detection** — Identify outliers in embedding space ===== Generation and Training Methods ===== ==== Word and Document Embeddings ==== **Word2Vec**, introduced in the 2013 paper "Efficient Estimation of Word Representations in Vector Space" by Mikolov et al., trains shallow neural networks on word co-occurrences. ((source [[https://www.yugabyte.com/key-concepts/what-is-vector-embedding/|Yugabyte: What is Vector Embedding]]) (([[https://arxiv.org/abs/1301.3781|Mikolov et al. - Efficient Estimation of Word Representations in Vector Space (2013]])) * **CBOW (Continuous Bag-of-Words)**: Predicts a target word from its context words by averaging context vectors as input. * **Skip-gram**: Reverses the CBOW approach, predicting context words from the target word. Better handles rare words and uses negative sampling for efficiency (([[https://arxiv.org/abs/1506.02640|Kiros et al. - Skip-Thought Vectors (2015]])). Word2Vec demonstrated that word relationships could be captured geometrically, establishing the foundation for all subsequent embedding work. **GloVe (Global Vectors)**, introduced by Pennington et al. in 2014, combines global matrix factorization with local context window methods to create embeddings that capture both global and local co-occurrence statistics. ==== Transformer-based Embeddings ==== Modern approaches use transformer models to generate contextual embeddings where the same word's representation varies based on surrounding context. These models like BERT, GPT, and specialized embedding models produce high-quality representations suitable for semantic search and similarity tasks (([[https://arxiv.org/abs/1910.01108|Devlin et al. - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019]])). ==== Multimodal Embeddings ==== Systems like CLIP and others generate embeddings that unify multiple data modalities (text and images), enabling cross-modal similarity search and comparison operations. ===== See Also ===== * [[embedding_models_comparison|Embedding Models Comparison]] * [[embedding_layers|Embedding Layers]] * [[per_layer_embeddings|Per-Layer Embeddings (PLE)]] * [[vector_database_rag|Role of a Vector Database in AI RAG Architecture]] * [[text_rendering_in_images|Text Rendering in Images]] ===== References =====