====== ChromaDB ======

**ChromaDB** is an open-source, AI-native embedding database designed to make it easy to build LLM applications with embeddings. With over **27,000 GitHub stars**, it provides a simple API for storing, querying, and managing vector embeddings with metadata filtering, making it the go-to choice for rapid prototyping and production RAG applications.

| **Repository** | [[https://github.com/chroma-core/chroma|github.com/chroma-core/chroma]] |
| **License** | Apache 2.0 |
| **Language** | Python, Rust |
| **Stars** | 27K+ |
| **Category** | Embedding Database |

===== Key Features =====

  * **Simple API** -- NumPy-like simplicity for adding, querying, and managing embeddings
  * **AI-Native Design** -- Purpose-built for embeddings and RAG workflows
  * **Pluggable Embedding Functions** -- Built-in support for OpenAI, Sentence Transformers, Cohere, Gemini, Jina AI, Ollama, and custom functions
  * **Hybrid Search** -- Vector similarity combined with metadata filtering, regex, and full-text search
  * **Multiple Deployment Modes** -- In-memory, persistent (embedded), and client-server
  * **Framework Integration** -- Native support for LangChain, LlamaIndex, and other RAG frameworks
  * **Metadata Arrays** -- Support for string, number, and boolean arrays in metadata filtering (added 2026)

===== Architecture =====

ChromaDB uses a collection-based structure separating vector storage, metadata storage, and embedding generation:

  * **Collection Layer** -- Logical groupings of embeddings, documents, IDs, and metadata
  * **Embedding Functions** -- Modular, pluggable components that generate vectors from text or images
  * **Vector Store** -- HNSW indexing for approximate nearest neighbor search with configurable distance metrics
  * **Metadata Store** -- Separate storage for structured metadata enabling efficient filtering
  * **Storage Backends** -- DuckDB+Parquet (default persistent), SQLite, or PostgreSQL

<mermaid>
graph TB
    subgraph Client["Client Layer"]
        PyClient[Python Client]
        JSClient[JavaScript Client]
        HTTP[HTTP Client]
    end
    subgraph Core["ChromaDB Core"]
        Collections[Collection Manager]
        EF[Embedding Functions]
        QE[Query Engine]
    end
    subgraph Search["Search Layer"]
        HNSW[HNSW Vector Index]
        MetaFilter[Metadata Filter]
        FTS[Full-Text Search]
        Hybrid[Hybrid Ranker]
    end
    subgraph Storage["Storage Backends"]
        Memory[In-Memory]
        DuckDB[DuckDB + Parquet]
        SQLite[SQLite]
        PG[PostgreSQL]
    end
    subgraph Embeddings["Embedding Providers"]
        OpenAI[OpenAI]
        ST[Sentence Transformers]
        Cohere[Cohere]
        Custom[Custom Functions]
    end
    Client --> Core
    EF --> Embeddings
    Core --> Search
    Search --> Storage
</mermaid>

===== Deployment Modes =====

ChromaDB supports three deployment modes for different use cases:

^ Mode ^ Description ^ Use Case ^
| **In-Memory** | Fully ephemeral, embedded in app | Prototyping, testing, MVPs |
| **Persistent (Embedded)** | Disk-based via DuckDB+Parquet | Local apps, development |
| **Client-Server** | HTTP API, multi-tenant, scalable | Production, distributed systems |

===== Metadata Filtering =====

ChromaDB combines vector similarity with exact metadata filters using ''where'' clauses:

  * Equality matching: ''{ "source": "wiki" }''
  * Range queries: ''{ "score": { "$gt": 0.8 } }''
  * Set membership: ''{ "topic": { "$in": ["ai", "ml"] } }''
  * Array metadata (2026): Complex multi-value filters on array fields

===== Code Example =====

<code python>
import chromadb
from chromadb.utils import embedding_functions

# Initialize persistent client
client = chromadb.PersistentClient(path="./chroma_db")

# Configure embedding function
ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-3-small"
)

# Create or get collection
collection = client.get_or_create_collection(
    name="documents",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"}
)

# Add documents with metadata
collection.add(
    documents=[
        "RAG combines retrieval with generation for grounded answers",
        "Vector databases store high-dimensional embeddings",
        "Knowledge graphs capture entity relationships"
    ],
    metadatas=[
        {"source": "tutorial", "topic": "rag"},
        {"source": "docs", "topic": "database"},
        {"source": "paper", "topic": "knowledge_graph"}
    ],
    ids=["doc1", "doc2", "doc3"]
)

# Query with metadata filter
results = collection.query(
    query_texts=["How does retrieval augmented generation work?"],
    n_results=3,
    where={"topic": {"$in": ["rag", "database"]}},
    include=["documents", "metadatas", "distances"]
)

for doc, meta, dist in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0]
):
    print(f"[{dist:.4f}] ({meta['source']}) {doc[:60]}...")
</code>

===== References =====

  * [[https://github.com/chroma-core/chroma|ChromaDB GitHub Repository]]
  * [[https://www.trychroma.com|ChromaDB Official Website]]
  * [[https://docs.trychroma.com|ChromaDB Documentation]]

===== See Also =====

  * [[qdrant|Qdrant]] -- High-performance Rust vector database
  * [[milvus|Milvus]] -- Cloud-native vector database at scale
  * [[mem0|Mem0]] -- Memory layer supporting ChromaDB as backend
  * [[ragflow|RAGFlow]] -- RAG engine for document understanding
  * [[lightrag|LightRAG]] -- Knowledge graph RAG framework