====== Firecrawl ======

**Firecrawl** is an API-first web scraping and crawling platform developed by [[https://mendable.ai|Mendable]] that converts any website into clean, LLM-ready markdown or structured data. Unlike traditional scrapers that return raw HTML, Firecrawl handles JavaScript rendering, pagination, anti-bot bypasses, and content cleaning automatically. With over 97,000 GitHub stars and backing from Y Combinator ($14.5M Series A), it has become essential infrastructure for RAG pipelines, AI agents, and data extraction workflows.

===== Architecture =====

Firecrawl operates as a cloud API service with multiple operational modes. Its core engine combines:

  * **Headless Chromium Rendering** — Full JavaScript execution for dynamic single-page applications
  * **Fire-Engine Technology** — Proprietary rendering pipeline delivering 33% faster speeds and 40% higher success rates than standard headless browsers
  * **Anti-Bot Handling** — Built-in stealth techniques including proxy rotation, fingerprint randomization, and CAPTCHA handling
  * **Content Cleaning** — Automatic removal of navigation, ads, and boilerplate, preserving only meaningful content
  * **Output Formatting** — Converts cleaned content to markdown, HTML, JSON, or structured data with metadata

===== Operational Modes =====

Firecrawl provides four primary modes for different use cases:

== Scrape Mode ==
Extract content from a single URL. Returns clean markdown with metadata (title, description, Open Graph tags, robots directives).

== Crawl Mode ==
Recursively discover and process all accessible subpages from a starting URL. Supports ''limit'', ''excludePaths'', ''includePaths'', and depth controls. No sitemap required.

== Map Mode ==
Generate a complete site map of all discoverable URLs without extracting content. Useful for planning targeted scrapes.

== Extract Mode ==
LLM-powered structured data extraction using schemas or natural language prompts. Define a Zod/Pydantic schema and Firecrawl returns typed JSON matching your specification.

===== LLM-Ready Output =====

Firecrawl's primary value proposition is producing data optimized for LLM consumption:

  * Clean markdown preserves document structure (headings, lists, tables, code blocks)
  * Metadata includes page title, description, language, and source URL
  * Noise removal strips navigation, footers, ads, and cookie banners
  * Batch processing handles thousands of URLs concurrently
  * Direct integration with RAG frameworks like LangChain and LlamaIndex

===== Code Example =====

<code python>
from firecrawl import FirecrawlApp

# Initialize with API key
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")

# Scrape a single page to markdown
result = app.scrape_url(
    "https://docs.python.org/3/tutorial/",
    params={"formats": ["markdown"]}
)
print(result["markdown"][:500])

# Crawl an entire site
crawl = app.crawl_url(
    "https://docs.python.org/3/tutorial/",
    params={
        "limit": 50,
        "scrapeOptions": {"formats": ["markdown"]},
        "excludePaths": ["/genindex*", "/search*"]
    }
)
for page in crawl["data"]:
    print(f"URL: {page['metadata']['sourceURL']}")
    print(f"Title: {page['metadata']['title']}")
    print(f"Content length: {len(page['markdown'])} chars")
    print("---")

# Extract structured data with a prompt
extracted = app.scrape_url(
    "https://example.com/pricing",
    params={
        "formats": ["extract"],
        "extract": {
            "prompt": "Extract all pricing tiers with name, price, and features list"
        }
    }
)
print(extracted["extract"])
</code>

===== Integration Ecosystem =====

Firecrawl integrates with major AI frameworks and tools:

  * **LangChain** — ''FireCrawlLoader'' document loader for scrape and crawl modes
  * **LlamaIndex** — Direct document ingestion for RAG pipelines
  * **MCP (Model Context Protocol)** — ''firecrawl-mcp-server'' for Claude, Cursor, and other MCP clients
  * **CLI** — ''npx firecrawl-cli'' for command-line usage and agent configuration

===== Architecture Diagram =====

<mermaid>
graph TD
    A["Your App / Agent"] --> B["Firecrawl API"]
    B --> C["Scrape Engine"]
    B --> D["Crawl Engine"]
    C --> E["Fire-Engine Renderer (Headless Chromium + JS)"]
    D --> E
    E --> F["Content Cleaning & Markdown Conversion"]
    F --> G["LLM-Ready Output (Markdown / JSON)"]
</mermaid>

===== Pricing =====

^ Tier ^ Pages/Month ^ Price ^ Key Features ^
| Free | 500 | $0 | Scrape, crawl, basic extract |
| Starter | 3,000 | $16/mo | Priority support |
| Standard | 100,000 | $83/mo | Batch scrape, webhooks |
| Scale | 500,000 | $333/mo | Dedicated infrastructure |

===== References =====

  * [[https://github.com/mendableai/firecrawl|Firecrawl GitHub Repository]]
  * [[https://docs.firecrawl.dev/|Firecrawl Documentation]]
  * [[https://www.firecrawl.dev/|Firecrawl Website]]
  * [[https://docs.langchain.com/oss/python/integrations/document_loaders/web_loaders/firecrawl|LangChain FireCrawl Loader]]

===== See Also =====

  * [[browser_use|Browser-Use]] — AI agent browser automation
  * [[composio|Composio]] — Tool integration platform for agents
  * [[e2b|E2B]] — Sandboxed code execution for agents