====== ChromaDB ====== **ChromaDB** is an open-source, AI-native embedding database designed to make it easy to build LLM applications with embeddings. With over **27,000 GitHub stars**, it provides a simple API for storing, querying, and managing vector embeddings with metadata filtering, making it the go-to choice for rapid prototyping and production RAG applications. | **Repository** | [[https://github.com/chroma-core/chroma|github.com/chroma-core/chroma]] | | **License** | Apache 2.0 | | **Language** | Python, Rust | | **Stars** | 27K+ | | **Category** | Embedding Database | ===== Key Features ===== * **Simple API** -- NumPy-like simplicity for adding, querying, and managing embeddings * **AI-Native Design** -- Purpose-built for embeddings and RAG workflows * **Pluggable Embedding Functions** -- Built-in support for OpenAI, Sentence Transformers, Cohere, Gemini, Jina AI, Ollama, and custom functions * **Hybrid Search** -- Vector similarity combined with metadata filtering, regex, and full-text search * **Multiple Deployment Modes** -- In-memory, persistent (embedded), and client-server * **Framework Integration** -- Native support for LangChain, LlamaIndex, and other RAG frameworks * **Metadata Arrays** -- Support for string, number, and boolean arrays in metadata filtering (added 2026) ===== Architecture ===== ChromaDB uses a collection-based structure separating vector storage, metadata storage, and embedding generation: * **Collection Layer** -- Logical groupings of embeddings, documents, IDs, and metadata * **Embedding Functions** -- Modular, pluggable components that generate vectors from text or images * **Vector Store** -- HNSW indexing for approximate nearest neighbor search with configurable distance metrics * **Metadata Store** -- Separate storage for structured metadata enabling efficient filtering * **Storage Backends** -- DuckDB+Parquet (default persistent), SQLite, or PostgreSQL graph TB subgraph Client["Client Layer"] PyClient[Python Client] JSClient[JavaScript Client] HTTP[HTTP Client] end subgraph Core["ChromaDB Core"] Collections[Collection Manager] EF[Embedding Functions] QE[Query Engine] end subgraph Search["Search Layer"] HNSW[HNSW Vector Index] MetaFilter[Metadata Filter] FTS[Full-Text Search] Hybrid[Hybrid Ranker] end subgraph Storage["Storage Backends"] Memory[In-Memory] DuckDB[DuckDB + Parquet] SQLite[SQLite] PG[PostgreSQL] end subgraph Embeddings["Embedding Providers"] OpenAI[OpenAI] ST[Sentence Transformers] Cohere[Cohere] Custom[Custom Functions] end Client --> Core EF --> Embeddings Core --> Search Search --> Storage ===== Deployment Modes ===== ChromaDB supports three deployment modes for different use cases: ^ Mode ^ Description ^ Use Case ^ | **In-Memory** | Fully ephemeral, embedded in app | Prototyping, testing, MVPs | | **Persistent (Embedded)** | Disk-based via DuckDB+Parquet | Local apps, development | | **Client-Server** | HTTP API, multi-tenant, scalable | Production, distributed systems | ===== Metadata Filtering ===== ChromaDB combines vector similarity with exact metadata filters using ''where'' clauses: * Equality matching: ''{ "source": "wiki" }'' * Range queries: ''{ "score": { "$gt": 0.8 } }'' * Set membership: ''{ "topic": { "$in": ["ai", "ml"] } }'' * Array metadata (2026): Complex multi-value filters on array fields ===== Code Example ===== import chromadb from chromadb.utils import embedding_functions # Initialize persistent client client = chromadb.PersistentClient(path="./chroma_db") # Configure embedding function ef = embedding_functions.OpenAIEmbeddingFunction( api_key="your-api-key", model_name="text-embedding-3-small" ) # Create or get collection collection = client.get_or_create_collection( name="documents", embedding_function=ef, metadata={"hnsw:space": "cosine"} ) # Add documents with metadata collection.add( documents=[ "RAG combines retrieval with generation for grounded answers", "Vector databases store high-dimensional embeddings", "Knowledge graphs capture entity relationships" ], metadatas=[ {"source": "tutorial", "topic": "rag"}, {"source": "docs", "topic": "database"}, {"source": "paper", "topic": "knowledge_graph"} ], ids=["doc1", "doc2", "doc3"] ) # Query with metadata filter results = collection.query( query_texts=["How does retrieval augmented generation work?"], n_results=3, where={"topic": {"$in": ["rag", "database"]}}, include=["documents", "metadatas", "distances"] ) for doc, meta, dist in zip( results["documents"][0], results["metadatas"][0], results["distances"][0] ): print(f"[{dist:.4f}] ({meta['source']}) {doc[:60]}...") ===== References ===== * [[https://github.com/chroma-core/chroma|ChromaDB GitHub Repository]] * [[https://www.trychroma.com|ChromaDB Official Website]] * [[https://docs.trychroma.com|ChromaDB Documentation]] ===== See Also ===== * [[qdrant|Qdrant]] -- High-performance Rust vector database * [[milvus|Milvus]] -- Cloud-native vector database at scale * [[mem0|Mem0]] -- Memory layer supporting ChromaDB as backend * [[ragflow|RAGFlow]] -- RAG engine for document understanding * [[lightrag|LightRAG]] -- Knowledge graph RAG framework