RAGFlow

RAGFlow is an open-source RAG (Retrieval-Augmented Generation) engine developed by Infiniflow that specializes in deep document understanding through advanced parsing capabilities including OCR, table structure recognition, and document layout analysis. With over 76,000 GitHub stars, it excels at handling complex documents that other RAG systems struggle with.

Repository	github.com/infiniflow/ragflow
License	Apache 2.0
Language	Python
Stars	76K+
Category	RAG Engine

Key Features

Deep Document Understanding – Advanced parsing of complex PDFs with OCR, table extraction, and layout recognition
DeepDoc Engine – Proprietary document analysis engine handling layout analysis, figure extraction, and rotation correction for scanned PDFs
Table Structure Recognition (TSR) – YOLOv8-based fine-tuned models that outperform AWS Textract on complex tables
Document Layout Recognition (DLR) – Identifies titles, paragraphs, figures, and multi-column layouts
Visual Model Flexibility – Autonomous selection of visual models per task (OCR/TSR/DLR) to balance speed and accuracy
Table of Contents Extraction – Uses LLMs during indexing to enable long-context RAG with structural context
Multi-Format Support – PDF, DOCX, XLSX, CSV, images, emails, and plain text

Architecture

RAGFlow decouples data extraction from chunking (since v0.17.0), allowing independent selection of visual models for each processing task. The pipeline flows through ingestion, parsing, embedding, retrieval, and generation stages.

Parser Layer – DeepDoc for advanced PDF handling, PlainParser for text, VisionParser for image-heavy documents
Extraction Layer – Separate OCR, TSR, and DLR models that can be independently configured
Chunking Layer – Structure-preserving strategies that maintain table rows with headers as self-contained units
Retrieval Layer – Hybrid search combining vector similarity with structural context from table of contents
Generation Layer – LLM-based answer generation with source attribution

graph TB subgraph Input["Document Input"] PDF[PDF Documents] DOCX[Word / Excel] IMG[Images] TXT[Text / Email] end subgraph Parsing["Deep Document Parsing"] DeepDoc[DeepDoc Engine] OCR[OCR Module] TSR[Table Structure Recognition] DLR[Layout Recognition] end subgraph Processing["Processing Pipeline"] Chunk[Chunking Engine] TOC[TOC Extraction] Embed[Embedding Generator] end subgraph Storage["Storage Layer"] VDB[(Vector Database)] Meta[(Metadata Store)] end subgraph Query["Query Pipeline"] Retrieve[Hybrid Retrieval] Rerank[Reranking] Generate[LLM Generation] end Input --> Parsing DeepDoc --> OCR DeepDoc --> TSR DeepDoc --> DLR Parsing --> Processing Processing --> Storage Storage --> Query TOC --> Retrieve

Document Parsing Details

RAGFlow's parsing capabilities are the core differentiator:

PDF Parsing – The most robust capability; DeepDoc handles layout analysis, figure extraction, and auto-detects rotation (90/180/270 degrees) via OCR
Table Extraction – Identifies complex tables including single-column and borderless layouts, outputting HTML/Markdown to preserve structure and relationships
Spreadsheets – RAGFlowExcelParser with openpyxl/pandas extracts cell values while retaining structure as HTML
DOCX – Uses python-docx for text and table extraction
Images – OCR and optional Vision Language Model processing

Code Example

import requests
 
RAGFLOW_API = "http://localhost:9380/api/v1"
API_KEY = "ragflow-your-api-key"
HEADERS = {"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"}
 
# Create a knowledge base (dataset)
dataset = requests.post(f"{RAGFLOW_API}/datasets",
    headers=HEADERS,
    json={"name": "technical_docs", "chunk_method": "naive"}
).json()
 
dataset_id = dataset["data"]["id"]
 
# Upload a document
with open("complex_report.pdf", "rb") as f:
    upload = requests.post(
        f"{RAGFLOW_API}/datasets/{dataset_id}/documents",
        headers={"Authorization": f"Bearer {API_KEY}"},
        files={"file": f}
    ).json()
 
# Query the knowledge base with RAG
answer = requests.post(f"{RAGFLOW_API}/chats",
    headers=HEADERS,
    json={"question": "What were the Q3 revenue figures?",
          "dataset_ids": [dataset_id]}
).json()
print(answer["data"]["answer"])

AI Agent Knowledge Base

Sidebar

Table of Contents

RAGFlow

Key Features

Architecture

Document Parsing Details

Code Example

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

RAGFlow

Key Features

Architecture

Document Parsing Details

Code Example

References

See Also

Page Tools