====== RAGFlow ======
**RAGFlow** is an open-source RAG (Retrieval-Augmented Generation) engine developed by Infiniflow that specializes in deep document understanding through advanced parsing capabilities including OCR, table structure recognition, and document layout analysis. With over **76,000 GitHub stars**, it excels at handling complex documents that other RAG systems struggle with.
| **Repository** | [[https://github.com/infiniflow/ragflow|github.com/infiniflow/ragflow]] |
| **License** | Apache 2.0 |
| **Language** | Python |
| **Stars** | 76K+ |
| **Category** | RAG Engine |
===== Key Features =====
* **Deep Document Understanding** -- Advanced parsing of complex PDFs with OCR, table extraction, and layout recognition
* **DeepDoc Engine** -- Proprietary document analysis engine handling layout analysis, figure extraction, and rotation correction for scanned PDFs
* **Table Structure Recognition (TSR)** -- YOLOv8-based fine-tuned models that outperform AWS Textract on complex tables
* **Document Layout Recognition (DLR)** -- Identifies titles, paragraphs, figures, and multi-column layouts
* **Visual Model Flexibility** -- Autonomous selection of visual models per task (OCR/TSR/DLR) to balance speed and accuracy
* **Table of Contents Extraction** -- Uses LLMs during indexing to enable long-context RAG with structural context
* **Multi-Format Support** -- PDF, DOCX, XLSX, CSV, images, emails, and plain text
===== Architecture =====
RAGFlow decouples data extraction from chunking (since v0.17.0), allowing independent selection of visual models for each processing task. The pipeline flows through ingestion, parsing, embedding, retrieval, and generation stages.
* **Parser Layer** -- DeepDoc for advanced PDF handling, PlainParser for text, VisionParser for image-heavy documents
* **Extraction Layer** -- Separate OCR, TSR, and DLR models that can be independently configured
* **Chunking Layer** -- Structure-preserving strategies that maintain table rows with headers as self-contained units
* **Retrieval Layer** -- Hybrid search combining vector similarity with structural context from table of contents
* **Generation Layer** -- LLM-based answer generation with source attribution
graph TB
subgraph Input["Document Input"]
PDF[PDF Documents]
DOCX[Word / Excel]
IMG[Images]
TXT[Text / Email]
end
subgraph Parsing["Deep Document Parsing"]
DeepDoc[DeepDoc Engine]
OCR[OCR Module]
TSR[Table Structure Recognition]
DLR[Layout Recognition]
end
subgraph Processing["Processing Pipeline"]
Chunk[Chunking Engine]
TOC[TOC Extraction]
Embed[Embedding Generator]
end
subgraph Storage["Storage Layer"]
VDB[(Vector Database)]
Meta[(Metadata Store)]
end
subgraph Query["Query Pipeline"]
Retrieve[Hybrid Retrieval]
Rerank[Reranking]
Generate[LLM Generation]
end
Input --> Parsing
DeepDoc --> OCR
DeepDoc --> TSR
DeepDoc --> DLR
Parsing --> Processing
Processing --> Storage
Storage --> Query
TOC --> Retrieve
===== Document Parsing Details =====
RAGFlow's parsing capabilities are the core differentiator:
* **PDF Parsing** -- The most robust capability; DeepDoc handles layout analysis, figure extraction, and auto-detects rotation (90/180/270 degrees) via OCR
* **Table Extraction** -- Identifies complex tables including single-column and borderless layouts, outputting HTML/Markdown to preserve structure and relationships
* **Spreadsheets** -- RAGFlowExcelParser with openpyxl/pandas extracts cell values while retaining structure as HTML
* **DOCX** -- Uses python-docx for text and table extraction
* **Images** -- OCR and optional Vision Language Model processing
===== Code Example =====
import requests
RAGFLOW_API = "http://localhost:9380/api/v1"
API_KEY = "ragflow-your-api-key"
HEADERS = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
# Create a knowledge base (dataset)
dataset = requests.post(f"{RAGFLOW_API}/datasets",
headers=HEADERS,
json={"name": "technical_docs", "chunk_method": "naive"}
).json()
dataset_id = dataset["data"]["id"]
# Upload a document
with open("complex_report.pdf", "rb") as f:
upload = requests.post(
f"{RAGFLOW_API}/datasets/{dataset_id}/documents",
headers={"Authorization": f"Bearer {API_KEY}"},
files={"file": f}
).json()
# Query the knowledge base with RAG
answer = requests.post(f"{RAGFLOW_API}/chats",
headers=HEADERS,
json={"question": "What were the Q3 revenue figures?",
"dataset_ids": [dataset_id]}
).json()
print(answer["data"]["answer"])
===== References =====
* [[https://github.com/infiniflow/ragflow|RAGFlow GitHub Repository]]
* [[https://ragflow.io|RAGFlow Official Website]]
* [[https://ragflow.io/docs/select_pdf_parser|RAGFlow PDF Parser Documentation]]
* [[https://github.com/infiniflow/ragflow/blob/main/deepdoc/README.md|DeepDoc README]]
===== See Also =====
* [[dify|Dify]] -- Agentic workflow platform with RAG capabilities
* [[lightrag|LightRAG]] -- Knowledge graph-enhanced RAG
* [[milvus|Milvus]] -- Vector database for RAG storage
* [[chromadb|ChromaDB]] -- AI-native embedding database