====== RAGFlow ====== **RAGFlow** is an open-source RAG (Retrieval-Augmented Generation) engine developed by Infiniflow that specializes in deep document understanding through advanced parsing capabilities including OCR, table structure recognition, and document layout analysis. With over **76,000 GitHub stars**, it excels at handling complex documents that other RAG systems struggle with. | **Repository** | [[https://github.com/infiniflow/ragflow|github.com/infiniflow/ragflow]] | | **License** | Apache 2.0 | | **Language** | Python | | **Stars** | 76K+ | | **Category** | RAG Engine | ===== Key Features ===== * **Deep Document Understanding** -- Advanced parsing of complex PDFs with OCR, table extraction, and layout recognition * **DeepDoc Engine** -- Proprietary document analysis engine handling layout analysis, figure extraction, and rotation correction for scanned PDFs * **Table Structure Recognition (TSR)** -- YOLOv8-based fine-tuned models that outperform AWS Textract on complex tables * **Document Layout Recognition (DLR)** -- Identifies titles, paragraphs, figures, and multi-column layouts * **Visual Model Flexibility** -- Autonomous selection of visual models per task (OCR/TSR/DLR) to balance speed and accuracy * **Table of Contents Extraction** -- Uses LLMs during indexing to enable long-context RAG with structural context * **Multi-Format Support** -- PDF, DOCX, XLSX, CSV, images, emails, and plain text ===== Architecture ===== RAGFlow decouples data extraction from chunking (since v0.17.0), allowing independent selection of visual models for each processing task. The pipeline flows through ingestion, parsing, embedding, retrieval, and generation stages. * **Parser Layer** -- DeepDoc for advanced PDF handling, PlainParser for text, VisionParser for image-heavy documents * **Extraction Layer** -- Separate OCR, TSR, and DLR models that can be independently configured * **Chunking Layer** -- Structure-preserving strategies that maintain table rows with headers as self-contained units * **Retrieval Layer** -- Hybrid search combining vector similarity with structural context from table of contents * **Generation Layer** -- LLM-based answer generation with source attribution graph TB subgraph Input["Document Input"] PDF[PDF Documents] DOCX[Word / Excel] IMG[Images] TXT[Text / Email] end subgraph Parsing["Deep Document Parsing"] DeepDoc[DeepDoc Engine] OCR[OCR Module] TSR[Table Structure Recognition] DLR[Layout Recognition] end subgraph Processing["Processing Pipeline"] Chunk[Chunking Engine] TOC[TOC Extraction] Embed[Embedding Generator] end subgraph Storage["Storage Layer"] VDB[(Vector Database)] Meta[(Metadata Store)] end subgraph Query["Query Pipeline"] Retrieve[Hybrid Retrieval] Rerank[Reranking] Generate[LLM Generation] end Input --> Parsing DeepDoc --> OCR DeepDoc --> TSR DeepDoc --> DLR Parsing --> Processing Processing --> Storage Storage --> Query TOC --> Retrieve ===== Document Parsing Details ===== RAGFlow's parsing capabilities are the core differentiator: * **PDF Parsing** -- The most robust capability; DeepDoc handles layout analysis, figure extraction, and auto-detects rotation (90/180/270 degrees) via OCR * **Table Extraction** -- Identifies complex tables including single-column and borderless layouts, outputting HTML/Markdown to preserve structure and relationships * **Spreadsheets** -- RAGFlowExcelParser with openpyxl/pandas extracts cell values while retaining structure as HTML * **DOCX** -- Uses python-docx for text and table extraction * **Images** -- OCR and optional Vision Language Model processing ===== Code Example ===== import requests RAGFLOW_API = "http://localhost:9380/api/v1" API_KEY = "ragflow-your-api-key" HEADERS = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"} # Create a knowledge base (dataset) dataset = requests.post(f"{RAGFLOW_API}/datasets", headers=HEADERS, json={"name": "technical_docs", "chunk_method": "naive"} ).json() dataset_id = dataset["data"]["id"] # Upload a document with open("complex_report.pdf", "rb") as f: upload = requests.post( f"{RAGFLOW_API}/datasets/{dataset_id}/documents", headers={"Authorization": f"Bearer {API_KEY}"}, files={"file": f} ).json() # Query the knowledge base with RAG answer = requests.post(f"{RAGFLOW_API}/chats", headers=HEADERS, json={"question": "What were the Q3 revenue figures?", "dataset_ids": [dataset_id]} ).json() print(answer["data"]["answer"]) ===== References ===== * [[https://github.com/infiniflow/ragflow|RAGFlow GitHub Repository]] * [[https://ragflow.io|RAGFlow Official Website]] * [[https://ragflow.io/docs/select_pdf_parser|RAGFlow PDF Parser Documentation]] * [[https://github.com/infiniflow/ragflow/blob/main/deepdoc/README.md|DeepDoc README]] ===== See Also ===== * [[dify|Dify]] -- Agentic workflow platform with RAG capabilities * [[lightrag|LightRAG]] -- Knowledge graph-enhanced RAG * [[milvus|Milvus]] -- Vector database for RAG storage * [[chromadb|ChromaDB]] -- AI-native embedding database