Architecture
Operational Modes
LLM-Ready Output
Code Example
Integration Ecosystem
Architecture Diagram
Pricing
References
See Also

Firecrawl

Firecrawl is an API-first web scraping and crawling platform developed by Mendable that converts any website into clean, LLM-ready markdown or structured data. Unlike traditional scrapers that return raw HTML, Firecrawl handles JavaScript rendering, pagination, anti-bot bypasses, and content cleaning automatically. With over 97,000 GitHub stars and backing from Y Combinator ($14.5M Series A), it has become essential infrastructure for RAG pipelines, AI agents, and data extraction workflows.

Architecture

Firecrawl operates as a cloud API service with multiple operational modes. Its core engine combines:

Headless Chromium Rendering — Full JavaScript execution for dynamic single-page applications
Fire-Engine Technology — Proprietary rendering pipeline delivering 33% faster speeds and 40% higher success rates than standard headless browsers
Anti-Bot Handling — Built-in stealth techniques including proxy rotation, fingerprint randomization, and CAPTCHA handling
Content Cleaning — Automatic removal of navigation, ads, and boilerplate, preserving only meaningful content
Output Formatting — Converts cleaned content to markdown, HTML, JSON, or structured data with metadata

Operational Modes

Firecrawl provides four primary modes for different use cases:

Scrape Mode

Extract content from a single URL. Returns clean markdown with metadata (title, description, Open Graph tags, robots directives).

Crawl Mode

Recursively discover and process all accessible subpages from a starting URL. Supports limit, excludePaths, includePaths, and depth controls. No sitemap required.

Map Mode

Generate a complete site map of all discoverable URLs without extracting content. Useful for planning targeted scrapes.

Extract Mode

LLM-powered structured data extraction using schemas or natural language prompts. Define a Zod/Pydantic schema and Firecrawl returns typed JSON matching your specification.

LLM-Ready Output

Firecrawl's primary value proposition is producing data optimized for LLM consumption:

Clean markdown preserves document structure (headings, lists, tables, code blocks)
Metadata includes page title, description, language, and source URL
Noise removal strips navigation, footers, ads, and cookie banners
Batch processing handles thousands of URLs concurrently
Direct integration with RAG frameworks like LangChain and LlamaIndex

Code Example

from firecrawl import FirecrawlApp
 
# Initialize with API key
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
 
# Scrape a single page to markdown
result = app.scrape_url(
    "https://docs.python.org/3/tutorial/",
    params={"formats": ["markdown"]}
)
print(result["markdown"][:500])
 
# Crawl an entire site
crawl = app.crawl_url(
    "https://docs.python.org/3/tutorial/",
    params={
        "limit": 50,
        "scrapeOptions": {"formats": ["markdown"]},
        "excludePaths": ["/genindex*", "/search*"]
    }
)
for page in crawl["data"]:
    print(f"URL: {page['metadata']['sourceURL']}")
    print(f"Title: {page['metadata']['title']}")
    print(f"Content length: {len(page['markdown'])} chars")
    print("---")
 
# Extract structured data with a prompt
extracted = app.scrape_url(
    "https://example.com/pricing",
    params={
        "formats": ["extract"],
        "extract": {
            "prompt": "Extract all pricing tiers with name, price, and features list"
        }
    }
)
print(extracted["extract"])

Integration Ecosystem

Firecrawl integrates with major AI frameworks and tools:

LangChain — FireCrawlLoader document loader for scrape and crawl modes
LlamaIndex — Direct document ingestion for RAG pipelines
MCP (Model Context Protocol) — firecrawl-mcp-server for Claude, Cursor, and other MCP clients
CLI — npx firecrawl-cli for command-line usage and agent configuration

Architecture Diagram

graph TD A["Your App / Agent"] --> B["Firecrawl API"] B --> C["Scrape Engine"] B --> D["Crawl Engine"] C --> E["Fire-Engine Renderer (Headless Chromium + JS)"] D --> E E --> F["Content Cleaning & Markdown Conversion"] F --> G["LLM-Ready Output (Markdown / JSON)"]

Pricing

Tier	Pages/Month	Price	Key Features
Free	500	$0	Scrape, crawl, basic extract
Starter	3,000	$16/mo	Priority support
Standard	100,000	$83/mo	Batch scrape, webhooks
Scale	500,000	$333/mo	Dedicated infrastructure

Table of Contents