====== RAGFlow ======
**RAGFlow** is an open-source RAG (Retrieval-Augmented Generation) engine developed by Infiniflow that specializes in deep document understanding through advanced parsing capabilities including OCR, table structure recognition, and document layout analysis.((https://[[github|github]].com/infiniflow/ragflow)) With over **76,000 [[github|GitHub]] stars**, it excels at handling complex documents that other RAG systems struggle with.

| **Repository** | [[https://[[github|github]].com/infiniflow/ragflow|github.com/infiniflow/ragflow]] |
| **License** | Apache 2.0 |
| **Language** | Python |
| **Stars** | 76K+ |
| **Category** | RAG Engine |(([[https://alphasignalai.substack.com/p/mineru-diffusion-ocr-has-been-reading|AlphaSignal (2026]])).

===== Key Features =====
  * **Deep Document Understanding**, Advanced parsing of complex PDFs with OCR, table extraction, and layout recognition((https://ragflow.io/docs/select_pdf_parser))
  * **DeepDoc Engine**, Proprietary document analysis engine handling layout analysis, figure extraction, and rotation correction for scanned PDFs((https://[[github|github]].com/infiniflow/ragflow/blob/main/deepdoc/README.md))
  * **Table Structure Recognition (TSR)**, YOLOv8-based fine-tuned models that outperform AWS Textract on complex tables
  * **Document Layout Recognition (DLR)**, Identifies titles, paragraphs, figures, and multi-column layouts
  * **Visual Model Flexibility**, Autonomous selection of visual models per task (OCR/TSR/DLR) to balance speed and accuracy
  * **Table of Contents Extraction**, Uses LLMs during indexing to enable long-context RAG with structural context
  * **Multi-Format Support**, PDF, DOCX, XLSX, CSV, images, emails, and plain text

===== Architecture =====
RAGFlow decouples data extraction from chunking (since v0.17.0), allowing independent selection of visual models for each processing task. The pipeline flows through ingestion, parsing, embedding, retrieval, and generation stages.((https://ragflow.io))

  * **Parser Layer**, DeepDoc for advanced PDF handling, PlainParser for text, VisionParser for image-heavy documents
  * **Extraction Layer**, Separate OCR, TSR, and DLR models that can be independently configured
  * **Chunking Layer**, Structure-preserving strategies that maintain table rows with headers as self-contained units
  * **Retrieval Layer**, [[hybrid_search|Hybrid search]] combining vector similarity with structural context from table of contents
  * **Generation Layer**, LLM-based answer generation with source attribution

<mermaid>
graph TB
    subgraph Input["Document Input"]
        PDF[PDF Documents]
        DOCX[Word / Excel]
        IMG[Images]
        TXT[Text / Email]
    end
    subgraph Parsing["Deep Document Parsing"]
        DeepDoc[DeepDoc Engine]
        OCR[OCR Module]
        TSR[Table Structure Recognition]
        DLR[Layout Recognition]
    end
    subgraph Processing["Processing Pipeline"]
        Chunk[Chunking Engine]
        TOC[TOC Extraction]
        Embed[Embedding Generator]
    end
    subgraph Storage["Storage Layer"]
        VDB[(Vector Database)]
        [[meta|Meta]][(Metadata Store)]
    end
    subgraph Query["Query Pipeline"]
        Retrieve[Hybrid Retrieval]
        Rerank[[[reranking|Reranking]]]
        Generate[LLM Generation]
    end
    Input --> Parsing
    DeepDoc --> OCR
    DeepDoc --> TSR
    DeepDoc --> DLR
    Parsing --> Processing
    Processing --> Storage
    Storage --> Query
    TOC --> Retrieve
</mermaid>

===== Document Parsing Details =====
RAGFlow's parsing capabilities are the core differentiator:((https://ragflow.io/docs/select_pdf_parser))

  * **PDF Parsing**, The most robust capability; DeepDoc handles layout analysis, figure extraction, and auto-detects rotation (90/180/270 degrees) via OCR
  * **Table Extraction**, Identifies complex tables including single-column and borderless layouts, outputting HTML/Markdown to preserve structure and relationships
  * **Spreadsheets**, RAGFlowExcelParser with openpyxl/pandas extracts cell values while retaining structure as HTML
  * **DOCX**, Uses python-docx for text and table extraction
  * **Images**, OCR and optional Vision Language Model processing

===== OCR and Vision Models =====
RAGFlow integrates multiple OCR and vision-based approaches for robust document understanding. Beyond its built-in OCR capabilities, the system can leverage complementary open-source models. **MinerU-Diffusion** is a 2.5B parameter open-source OCR model released by researchers from [[shanghai_ai_lab|Shanghai AI Lab]] and Peking University that supports layout detection, plain text recognition, LaTeX formula output, and table recognition with high throughput for document processing pipelines.((https://alphasignalai.substack.com/p/mineru-diffusion-ocr-has-been-reading|AlphaSignal AI - MinerU-Diffusion: OCR Has Been Reading (Year))))

RAGFlow's visual model flexibility allows users to configure which OCR and parsing models suit their specific document types and performance requirements, enabling integration with specialized open-source models where appropriate.

===== Code Example =====
<code python>
import requests

RAGFLOW_API = "http://localhost:9380/api/v1"
API_KEY = "ragflow-your-api-key"
HEADERS = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}

# Create a knowledge base (dataset)
dataset = requests.post(f"{RAGFLOW_API}/datasets",
    headers=HEADERS,
    json={"name": "technical_docs", "chunk_method": "naive"}
).json()

dataset_id = dataset["data"]["id"]

# Upload a document
with open("complex_report.pdf", "rb") as f:
    upload = requests.post(
        f"{RAGFLOW_API}/datasets/{dataset_id}/documents",
        headers={"Authorization": f"Bearer {API_KEY}"},
        files={"file": f}
    ).json()

# Query the knowledge base with RAG
answer = requests.post(f"{RAGFLOW_API}/chats",
    headers=HEADERS,
    json={"question": "What were the Q3 revenue figures?",
          "dataset_ids": dataset_id}
).json()
print(answer["data"]["answer"])
</code>

===== See Also =====
  * [[how_to_build_a_rag_pipeline|How to Build a RAG Pipeline]]
  * [[ragas|RAGAS: RAG Evaluation Framework]]
  * [[rag_phases|Phases of a RAG System]]
  * [[rag_in_ai|Retrieval-Augmented Generation (RAG) in AI]]
  * [[chunking_strategies|Chunking Strategies]]

===== References =====