====== ChromaDB ======
**ChromaDB** is an open-source, AI-native embedding database designed to make it easy to build LLM applications with embeddings. With over **27,000 GitHub stars**, it provides a simple API for storing, querying, and managing vector embeddings with metadata filtering, making it the go-to choice for rapid prototyping and production RAG applications.
| **Repository** | [[https://github.com/chroma-core/chroma|github.com/chroma-core/chroma]] |
| **License** | Apache 2.0 |
| **Language** | Python, Rust |
| **Stars** | 27K+ |
| **Category** | Embedding Database |
===== Key Features =====
* **Simple API** -- NumPy-like simplicity for adding, querying, and managing embeddings
* **AI-Native Design** -- Purpose-built for embeddings and RAG workflows
* **Pluggable Embedding Functions** -- Built-in support for OpenAI, Sentence Transformers, Cohere, Gemini, Jina AI, Ollama, and custom functions
* **Hybrid Search** -- Vector similarity combined with metadata filtering, regex, and full-text search
* **Multiple Deployment Modes** -- In-memory, persistent (embedded), and client-server
* **Framework Integration** -- Native support for LangChain, LlamaIndex, and other RAG frameworks
* **Metadata Arrays** -- Support for string, number, and boolean arrays in metadata filtering (added 2026)
===== Architecture =====
ChromaDB uses a collection-based structure separating vector storage, metadata storage, and embedding generation:
* **Collection Layer** -- Logical groupings of embeddings, documents, IDs, and metadata
* **Embedding Functions** -- Modular, pluggable components that generate vectors from text or images
* **Vector Store** -- HNSW indexing for approximate nearest neighbor search with configurable distance metrics
* **Metadata Store** -- Separate storage for structured metadata enabling efficient filtering
* **Storage Backends** -- DuckDB+Parquet (default persistent), SQLite, or PostgreSQL
graph TB
subgraph Client["Client Layer"]
PyClient[Python Client]
JSClient[JavaScript Client]
HTTP[HTTP Client]
end
subgraph Core["ChromaDB Core"]
Collections[Collection Manager]
EF[Embedding Functions]
QE[Query Engine]
end
subgraph Search["Search Layer"]
HNSW[HNSW Vector Index]
MetaFilter[Metadata Filter]
FTS[Full-Text Search]
Hybrid[Hybrid Ranker]
end
subgraph Storage["Storage Backends"]
Memory[In-Memory]
DuckDB[DuckDB + Parquet]
SQLite[SQLite]
PG[PostgreSQL]
end
subgraph Embeddings["Embedding Providers"]
OpenAI[OpenAI]
ST[Sentence Transformers]
Cohere[Cohere]
Custom[Custom Functions]
end
Client --> Core
EF --> Embeddings
Core --> Search
Search --> Storage
===== Deployment Modes =====
ChromaDB supports three deployment modes for different use cases:
^ Mode ^ Description ^ Use Case ^
| **In-Memory** | Fully ephemeral, embedded in app | Prototyping, testing, MVPs |
| **Persistent (Embedded)** | Disk-based via DuckDB+Parquet | Local apps, development |
| **Client-Server** | HTTP API, multi-tenant, scalable | Production, distributed systems |
===== Metadata Filtering =====
ChromaDB combines vector similarity with exact metadata filters using ''where'' clauses:
* Equality matching: ''{ "source": "wiki" }''
* Range queries: ''{ "score": { "$gt": 0.8 } }''
* Set membership: ''{ "topic": { "$in": ["ai", "ml"] } }''
* Array metadata (2026): Complex multi-value filters on array fields
===== Code Example =====
import chromadb
from chromadb.utils import embedding_functions
# Initialize persistent client
client = chromadb.PersistentClient(path="./chroma_db")
# Configure embedding function
ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-api-key",
model_name="text-embedding-3-small"
)
# Create or get collection
collection = client.get_or_create_collection(
name="documents",
embedding_function=ef,
metadata={"hnsw:space": "cosine"}
)
# Add documents with metadata
collection.add(
documents=[
"RAG combines retrieval with generation for grounded answers",
"Vector databases store high-dimensional embeddings",
"Knowledge graphs capture entity relationships"
],
metadatas=[
{"source": "tutorial", "topic": "rag"},
{"source": "docs", "topic": "database"},
{"source": "paper", "topic": "knowledge_graph"}
],
ids=["doc1", "doc2", "doc3"]
)
# Query with metadata filter
results = collection.query(
query_texts=["How does retrieval augmented generation work?"],
n_results=3,
where={"topic": {"$in": ["rag", "database"]}},
include=["documents", "metadatas", "distances"]
)
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
print(f"[{dist:.4f}] ({meta['source']}) {doc[:60]}...")
===== References =====
* [[https://github.com/chroma-core/chroma|ChromaDB GitHub Repository]]
* [[https://www.trychroma.com|ChromaDB Official Website]]
* [[https://docs.trychroma.com|ChromaDB Documentation]]
===== See Also =====
* [[qdrant|Qdrant]] -- High-performance Rust vector database
* [[milvus|Milvus]] -- Cloud-native vector database at scale
* [[mem0|Mem0]] -- Memory layer supporting ChromaDB as backend
* [[ragflow|RAGFlow]] -- RAG engine for document understanding
* [[lightrag|LightRAG]] -- Knowledge graph RAG framework