====== WebGPT: Browser-Assisted Question Answering with Human Feedback ======

**WebGPT** is a pioneering web-browsing QA system developed by Nakano et al. (2021) at **OpenAI** that fine-tunes GPT-3 to answer long-form questions by interacting with a text-based web browser and optimizing responses through reinforcement learning from human feedback (RLHF).(([[https://arxiv.org/abs/2112.09332|Nakano et al. "WebGPT: Browser-assisted question-answering with human feedback" (2021)]])) With **1,720 citations**, it established foundational techniques for retrieval-augmented generation and web-browsing agents that influenced subsequent systems including ChatGPT's browsing capabilities.

[[https://arxiv.org/abs/2112.09332|arXiv:2112.09332]]

===== Text-Based Web Browser Interface =====

WebGPT operates through a custom text-based browser environment that presents web pages as simplified text. The model issues structured commands:

  * **Search [query]**: Execute a Bing search query
  * **Click [link]**: Navigate to a specific link on the current page
  * **Scroll [direction]**: Move up or down the current page
  * **Quote [text]**: Extract and save a passage as a reference
  * **End**: Terminate browsing and compose the final answer

At each step, the model receives a summary of the current browser state (page content, URL, scroll position) and selects the next action. The browsing session continues until the model issues an End command, reaches a maximum number of actions, or exceeds the total reference length.

===== RLHF Training Pipeline =====

The training process combines two complementary approaches:

==== Imitation Learning (Behavior Cloning) ====

Human demonstrators use the same text-based browser to answer questions, generating supervised training data of (question, browsing trajectory, answer) triples.(([[https://arxiv.org/abs/2203.02155|Ouyang et al. "Training language models to follow instructions with human feedback" (2022)]]))

==== Reward Modeling from Human Preferences ====

A reward model $R_\phi$ is trained on pairwise human comparisons:

$$\mathcal{L}(\phi) = -\mathbb{E}_{(y_w, y_l) \sim \mathcal{D}} \left[\log \sigma\left(R_\phi(y_w) - R_\phi(y_l)\right)\right]$$

where $y_w$ is the human-preferred answer and $y_l$ is the rejected one. The policy is then optimized via PPO to maximize this learned reward while maintaining proximity to the behavior-cloned policy through a KL penalty:

$$\max_\theta \mathbb{E}_{y \sim \pi_\theta} \left[R_\phi(y)\right] - \beta \, \text{KL}(\pi_\theta \| \pi_{\text{BC}})$$

===== System Architecture =====

<mermaid>
graph TD
    A[User Question] --> B[GPT-3 Agent]
    B --> C{Select Action}
    C --> D[Search via Bing API]
    C --> E[Click Link]
    C --> F[Scroll Page]
    C --> G[Quote Extract]
    C --> H[End Browsing]
    D --> I[Browser State Update]
    E --> I
    F --> I
    G --> J[Reference Store]
    I --> B
    H --> K[Compose Answer with References]
    J --> K
    K --> L[Final Answer]
    M[Human Demonstrations] --> N[Behavior Cloning]
    O[Human Preference Pairs] --> P[Reward Model]
    N --> Q[Initial Policy]
    P --> R[PPO Optimization]
    Q --> R
    R --> B
</mermaid>

===== Code Example =====

<code python>
# Simplified WebGPT browsing loop
ACTIONS = ["search", "click", "scroll_up", "scroll_down", "quote", "end"]

def webgpt_answer(question, policy_model, browser, max_steps=20):
    browser.reset()
    references = []
    context = f"Question: {question}\n"

    for step in range(max_steps):
        browser_state = browser.get_state_summary()
        action, param = policy_model.predict(context + browser_state)

        if action == "search":
            browser.search(param)
        elif action == "click":
            browser.click(param)
        elif action == "quote":
            ref = browser.extract_quote(param)
            references.append(ref)
        elif action == "end":
            break

        context += f"Action: {action} {param}\n"
    return policy_model.compose_answer(question, references)

def train_reward_model(reward_model, preference_pairs):
    for answer_w, answer_l in preference_pairs:
        reward_w = reward_model(answer_w)
        reward_l = reward_model(answer_l)
        loss = -log_sigmoid(reward_w - reward_l)
        loss.backward()
</code>

===== Key Results =====

  * Best model preferred over human demonstrator answers **56% of the time** on ELI5(([[https://arxiv.org/abs/1909.01066|Fan et al. "ELI5: Long Form Question Answering" (2019)]]))
  * Preferred over highest-voted Reddit answers **69% of the time**
  * Factually accurate **75% of the time**; both true and informative **54%**
  * Significantly reduces hallucinations compared to unaided GPT-3(([[https://arxiv.org/abs/2005.14165|Brown et al. "Language Models are Few-Shot Learners" (GPT-3, 2020)]]))
  * Demonstrates that web browsing + RLHF is a viable path to factual QA

===== See Also =====

  * [[palm_e|PaLM-E: Embodied Multimodal Language Model]]
  * [[reasoning_via_planning|RAP: Reasoning via Planning]]
  * [[expel_experiential_learning|ExpeL: Experiential Learning Agents]]

===== References =====