This is an old revision of the document!

WebGPT: Browser-Assisted Question Answering with Human Feedback

WebGPT is a pioneering web-browsing QA system developed by Nakano et al. (2021) at OpenAI that fine-tunes GPT-3 to answer long-form questions by interacting with a text-based web browser and optimizing responses through reinforcement learning from human feedback (RLHF).¹⁾ With 1,720 citations, it established foundational techniques for retrieval-augmented generation and web-browsing agents that influenced subsequent systems including ChatGPT's browsing capabilities.

arXiv:2112.09332

Text-Based Web Browser Interface

WebGPT operates through a custom text-based browser environment that presents web pages as simplified text. The model issues structured commands:

Search [query]: Execute a Bing search query
Click [link]: Navigate to a specific link on the current page
Scroll [direction]: Move up or down the current page
Quote [text]: Extract and save a passage as a reference
End: Terminate browsing and compose the final answer

At each step, the model receives a summary of the current browser state (page content, URL, scroll position) and selects the next action. The browsing session continues until the model issues an End command, reaches a maximum number of actions, or exceeds the total reference length.

RLHF Training Pipeline

The training process combines two complementary approaches:

Imitation Learning (Behavior Cloning)

Human demonstrators use the same text-based browser to answer questions, generating supervised training data of (question, browsing trajectory, answer) triples.²⁾

Reward Modeling from Human Preferences

A reward model $R_\phi$ is trained on pairwise human comparisons:

$$\mathcal{L}(\phi) = -\mathbb{E}_{(y_w, y_l) \sim \mathcal{D}} \left[\log \sigma\left(R_\phi(y_w) - R_\phi(y_l)\right)\right]$$

where $y_w$ is the human-preferred answer and $y_l$ is the rejected one. The policy is then optimized via PPO to maximize this learned reward while maintaining proximity to the behavior-cloned policy through a KL penalty:

$$\max_\theta \mathbb{E}_{y \sim \pi_\theta} \left[R_\phi(y)\right] - \beta \, \text{KL}(\pi_\theta \| \pi_{\text{BC}})$$

System Architecture

graph TD A[User Question] --> B[GPT-3 Agent] B --> C{Select Action} C --> D[Search via Bing API] C --> E[Click Link] C --> F[Scroll Page] C --> G[Quote Extract] C --> H[End Browsing] D --> I[Browser State Update] E --> I F --> I G --> J[Reference Store] I --> B H --> K[Compose Answer with References] J --> K K --> L[Final Answer] M[Human Demonstrations] --> N[Behavior Cloning] O[Human Preference Pairs] --> P[Reward Model] N --> Q[Initial Policy] P --> R[PPO Optimization] Q --> R R --> B

Code Example

# Simplified WebGPT browsing loop
ACTIONS = ["search", "click", "scroll_up", "scroll_down", "quote", "end"]
 
def webgpt_answer(question, policy_model, browser, max_steps=20):
    browser.reset()
    references = []
    context = f"Question: {question}\n"
 
    for step in range(max_steps):
        browser_state = browser.get_state_summary()
        action, param = policy_model.predict(context + browser_state)
 
        if action == "search":
            browser.search(param)
        elif action == "click":
            browser.click(param)
        elif action == "quote":
            ref = browser.extract_quote(param)
            references.append(ref)
        elif action == "end":
            break
 
        context += f"Action: {action} {param}\n"
    return policy_model.compose_answer(question, references)
 
def train_reward_model(reward_model, preference_pairs):
    for answer_w, answer_l in preference_pairs:
        reward_w = reward_model(answer_w)
        reward_l = reward_model(answer_l)
        loss = -log_sigmoid(reward_w - reward_l)
        loss.backward()

Key Results

Best model preferred over human demonstrator answers 56% of the time on ELI5³⁾
Preferred over highest-voted Reddit answers 69% of the time
Factually accurate 75% of the time; both true and informative 54%
Significantly reduces hallucinations compared to unaided GPT-3⁴⁾
Demonstrates that web browsing + RLHF is a viable path to factual QA

AI Agent Knowledge Base

Sidebar

Table of Contents

WebGPT: Browser-Assisted Question Answering with Human Feedback

Text-Based Web Browser Interface

RLHF Training Pipeline

Imitation Learning (Behavior Cloning)

Reward Modeling from Human Preferences

System Architecture

Code Example

Key Results

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

WebGPT: Browser-Assisted Question Answering with Human Feedback

Text-Based Web Browser Interface

RLHF Training Pipeline

Imitation Learning (Behavior Cloning)

Reward Modeling from Human Preferences

System Architecture

Code Example

Key Results

References

See Also

Page Tools