ChromaDB

ChromaDB is an open-source, AI-native embedding database designed to make it easy to build LLM applications with embeddings. With over 27,000 GitHub stars, it provides a simple API for storing, querying, and managing vector embeddings with metadata filtering, making it the go-to choice for rapid prototyping and production RAG applications.

Repository	github.com/chroma-core/chroma
License	Apache 2.0
Language	Python, Rust
Stars	27K+
Category	Embedding Database

Key Features

Simple API – NumPy-like simplicity for adding, querying, and managing embeddings
AI-Native Design – Purpose-built for embeddings and RAG workflows
Pluggable Embedding Functions – Built-in support for OpenAI, Sentence Transformers, Cohere, Gemini, Jina AI, Ollama, and custom functions
Hybrid Search – Vector similarity combined with metadata filtering, regex, and full-text search
Multiple Deployment Modes – In-memory, persistent (embedded), and client-server
Framework Integration – Native support for LangChain, LlamaIndex, and other RAG frameworks
Metadata Arrays – Support for string, number, and boolean arrays in metadata filtering (added 2026)

Architecture

ChromaDB uses a collection-based structure separating vector storage, metadata storage, and embedding generation:

Collection Layer – Logical groupings of embeddings, documents, IDs, and metadata
Embedding Functions – Modular, pluggable components that generate vectors from text or images
Vector Store – HNSW indexing for approximate nearest neighbor search with configurable distance metrics
Metadata Store – Separate storage for structured metadata enabling efficient filtering
Storage Backends – DuckDB+Parquet (default persistent), SQLite, or PostgreSQL

graph TB subgraph Client["Client Layer"] PyClient[Python Client] JSClient[JavaScript Client] HTTP[HTTP Client] end subgraph Core["ChromaDB Core"] Collections[Collection Manager] EF[Embedding Functions] QE[Query Engine] end subgraph Search["Search Layer"] HNSW[HNSW Vector Index] MetaFilter[Metadata Filter] FTS[Full-Text Search] Hybrid[Hybrid Ranker] end subgraph Storage["Storage Backends"] Memory[In-Memory] DuckDB[DuckDB + Parquet] SQLite[SQLite] PG[PostgreSQL] end subgraph Embeddings["Embedding Providers"] OpenAI[OpenAI] ST[Sentence Transformers] Cohere[Cohere] Custom[Custom Functions] end Client --> Core EF --> Embeddings Core --> Search Search --> Storage

Deployment Modes

ChromaDB supports three deployment modes for different use cases:

Mode	Description	Use Case
In-Memory	Fully ephemeral, embedded in app	Prototyping, testing, MVPs
Persistent (Embedded)	Disk-based via DuckDB+Parquet	Local apps, development
Client-Server	HTTP API, multi-tenant, scalable	Production, distributed systems

Metadata Filtering

ChromaDB combines vector similarity with exact metadata filters using where clauses:

Equality matching: { “source”: “wiki” }
Range queries: { “score”: { “$gt”: 0.8 } }
Set membership: { “topic”: { “$in”: [“ai”, “ml”] } }
Array metadata (2026): Complex multi-value filters on array fields

Code Example

import chromadb
from chromadb.utils import embedding_functions
 
# Initialize persistent client
client = chromadb.PersistentClient(path="./chroma_db")
 
# Configure embedding function
ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-api-key",
    model_name="text-embedding-3-small"
)
 
# Create or get collection
collection = client.get_or_create_collection(
    name="documents",
    embedding_function=ef,
    metadata={"hnsw:space": "cosine"}
)
 
# Add documents with metadata
collection.add(
    documents=[
        "RAG combines retrieval with generation for grounded answers",
        "Vector databases store high-dimensional embeddings",
        "Knowledge graphs capture entity relationships"
    ],
    metadatas=[
        {"source": "tutorial", "topic": "rag"},
        {"source": "docs", "topic": "database"},
        {"source": "paper", "topic": "knowledge_graph"}
    ],
    ids=["doc1", "doc2", "doc3"]
)
 
# Query with metadata filter
results = collection.query(
    query_texts=["How does retrieval augmented generation work?"],
    n_results=3,
    where={"topic": {"$in": ["rag", "database"]}},
    include=["documents", "metadatas", "distances"]
)
 
for doc, meta, dist in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0]
):
    print(f"[{dist:.4f}] ({meta['source']}) {doc[:60]}...")

Table of Contents