====== Web Browsing Agents ======
Web browsing agents are AI systems that autonomously navigate websites, interact with page elements, extract information, and complete multi-step web-based tasks. They combine large language models with browser automation frameworks to understand web pages semantically rather than relying on brittle CSS selectors, representing a fundamental shift in web automation architecture.

===== How Web Browsing Agents Work =====
Web browsing agents operate by combining visual or DOM understanding with LLM reasoning:

  - **Page observation** — The agent receives a screenshot, accessibility tree, or DOM representation of the current page
  - **Action planning** — The LLM reasons about what action to take next (click, type, scroll, navigate)
  - **Action execution** — Browser automation (Playwright/Puppeteer) executes the planned action
  - **Result evaluation** — The agent observes the new page state and decides the next step
  - **Task completion** — The loop continues until the objective is achieved or the agent determines it cannot proceed

===== Browser Automation Frameworks =====
| **Framework** | **Type** | **Key Feature** |
| [[https://playwright.dev|Playwright]] | Library (MS) | Cross-browser, auto-wait, CDP access |
| [[https://pptr.dev|Puppeteer]] | Library ([[google|Google]]) | Chrome DevTools Protocol native |
| [[https://www.browserbase.com|Browserbase]] | Cloud infra | Managed sessions, anti-bot, persistent state |
| [[https://www.firecrawl.dev|Firecrawl]] | Data extraction | Natural language extraction, markdown output |
| [[https://www.hyperbrowser.ai|Hyperbrowser]] | Cloud infra | CAPTCHA solving, proxy rotation |

===== Agent Frameworks and Research =====
**Browser Use** is an open-source Python framework that connects LLMs to browser automation, providing a high-level API for agents to interact with web pages using natural language instructions.(([[https://[[github|github]])).com/nicklashansen/browser-use|Browser Use - Open Source Framework]]))

**Stagehand** by Browserbase provides an AI-native browser automation SDK where developers describe actions in natural language instead of writing selectors.(([[https://www.browserbase.com|Browserbase - Cloud Browser Infrastructure]]))

**[[webvoyager|WebVoyager]]** is a research agent from academia that demonstrates end-to-end web task completion using vision-language models to understand screenshots and plan actions.

**Mind2Web** provides a benchmark dataset of over 2,000 web tasks across 137 real websites, used to evaluate how well agents generalize across diverse web interfaces.

**[[google|Google]] Chrome Skills** enables users to turn Gemini prompts into reusable browser workflows without coding. Skills allow saving prompts as one-click actions that interact with the current page or tabs, with a library of pre-made Skills available for immediate use.(([[https://www.latent.space/p/ainews-humanitys-last-gasp|Latent Space - AI News]])) This represents a form of lightweight end-user agentization that brings agentic capabilities directly into the browser environment.

===== Example: Browser Agent with Playwright =====
<code python>
from playwright.async_api import async_playwright

async def browser_agent(task: str, llm_client):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://example.com")

        for step in range(10):  # Max steps
            # Capture page state for the LLM
            title = await page.title()
            content = await page.inner_text("body")
            screenshot = await page.screenshot()

            # Ask LLM to decide next action
            action = llm_client.decide_action(
                task=task,
                page_title=title,
                page_content=content[:4000],
                screenshot=screenshot
            )

            if action["type"] == "click":
                await page.click(action["selector"])
            elif action["type"] == "fill":
                await page.fill(action["selector"], action["value"])
            elif action["type"] == "navigate":
                await page.goto(action["url"])
            elif action["type"] == "done":
                return action["result"]

        await browser.close()
</code>

===== Consumer Agentic Browsers =====
Several full browsers with integrated AI agents launched in 2025-2026:

  * **[[perplexity_ai|Perplexity]] Comet** — Chromium-based browser with an autonomous assistant that navigates sites, fills forms, and manages tasks with voice control
  * **ChatGPT Atlas** — [[openai|OpenAI]]'s browser with Agent Mode for autonomous task completion and context-aware sidebar
  * **Genspark** — AI browser with a Super Agent for hands-free execution and deep search that crawls 10-15 pages per task(([[https://www.firecrawl.dev/blog/best-browser-agents|Firecrawl - Best Browser Agents]]))
  * **Sigma** — Privacy-first browser running its AI assistant locally without cloud dependency

===== Key Challenges =====
  * **Dynamic content** — Single-page apps and JavaScript-heavy sites require waiting and re-evaluation
  * **Anti-bot measures** — CAPTCHAs, rate limits, and fingerprinting [[block|block]] automated access
  * **Action grounding** — Translating LLM decisions into precise DOM interactions remains error-prone
  * **Safety** — Agents with web access can inadvertently submit forms, make purchases, or leak data; web browsing capability expands the surface area for security risks and unintended web interactions.(([[https://tldr.tech/ai/2026-04-14|TLDR AI - Web Browsing Capability (2026]]))

===== See Also =====
  * [[browsing_agent|Browsing Agent]]
  * [[simple_browse_vs_research_agents|Simple Browse Agents vs Full-Stack Research Agents]]
  * [[computer_use_agents|Computer Use Agents]]
  * [[browsecomp_k2_6_swarm_vs_base|BrowseComp: Kimi K2.6 with Agent Swarm vs Base]]
  * [[web_skill_extraction|Web Skill Extraction]]

===== References =====