====== WebGPT: Browser-Assisted Question Answering with Human Feedback ======
**WebGPT** is a pioneering web-browsing QA system developed by Nakano et al. (2021) at **OpenAI** that fine-tunes GPT-3 to answer long-form questions by interacting with a text-based web browser and optimizing responses through reinforcement learning from human feedback (RLHF).(([[https://arxiv.org/abs/2112.09332|Nakano et al. "WebGPT: Browser-assisted question-answering with human feedback" (2021)]])) With **1,720 citations**, it established foundational techniques for retrieval-augmented generation and web-browsing agents that influenced subsequent systems including ChatGPT's browsing capabilities.
[[https://arxiv.org/abs/2112.09332|arXiv:2112.09332]]
===== Text-Based Web Browser Interface =====
WebGPT operates through a custom text-based browser environment that presents web pages as simplified text. The model issues structured commands:
* **Search [query]**: Execute a Bing search query
* **Click [link]**: Navigate to a specific link on the current page
* **Scroll [direction]**: Move up or down the current page
* **Quote [text]**: Extract and save a passage as a reference
* **End**: Terminate browsing and compose the final answer
At each step, the model receives a summary of the current browser state (page content, URL, scroll position) and selects the next action. The browsing session continues until the model issues an End command, reaches a maximum number of actions, or exceeds the total reference length.
===== RLHF Training Pipeline =====
The training process combines two complementary approaches:
==== Imitation Learning (Behavior Cloning) ====
Human demonstrators use the same text-based browser to answer questions, generating supervised training data of (question, browsing trajectory, answer) triples.(([[https://arxiv.org/abs/2203.02155|Ouyang et al. "Training language models to follow instructions with human feedback" (2022)]]))
==== Reward Modeling from Human Preferences ====
A reward model $R_\phi$ is trained on pairwise human comparisons:
$$\mathcal{L}(\phi) = -\mathbb{E}_{(y_w, y_l) \sim \mathcal{D}} \left[\log \sigma\left(R_\phi(y_w) - R_\phi(y_l)\right)\right]$$
where $y_w$ is the human-preferred answer and $y_l$ is the rejected one. The policy is then optimized via PPO to maximize this learned reward while maintaining proximity to the behavior-cloned policy through a KL penalty:
$$\max_\theta \mathbb{E}_{y \sim \pi_\theta} \left[R_\phi(y)\right] - \beta \, \text{KL}(\pi_\theta \| \pi_{\text{BC}})$$
===== System Architecture =====
graph TD
A[User Question] --> B[GPT-3 Agent]
B --> C{Select Action}
C --> D[Search via Bing API]
C --> E[Click Link]
C --> F[Scroll Page]
C --> G[Quote Extract]
C --> H[End Browsing]
D --> I[Browser State Update]
E --> I
F --> I
G --> J[Reference Store]
I --> B
H --> K[Compose Answer with References]
J --> K
K --> L[Final Answer]
M[Human Demonstrations] --> N[Behavior Cloning]
O[Human Preference Pairs] --> P[Reward Model]
N --> Q[Initial Policy]
P --> R[PPO Optimization]
Q --> R
R --> B
===== Code Example =====
# Simplified WebGPT browsing loop
ACTIONS = ["search", "click", "scroll_up", "scroll_down", "quote", "end"]
def webgpt_answer(question, policy_model, browser, max_steps=20):
browser.reset()
references = []
context = f"Question: {question}\n"
for step in range(max_steps):
browser_state = browser.get_state_summary()
action, param = policy_model.predict(context + browser_state)
if action == "search":
browser.search(param)
elif action == "click":
browser.click(param)
elif action == "quote":
ref = browser.extract_quote(param)
references.append(ref)
elif action == "end":
break
context += f"Action: {action} {param}\n"
return policy_model.compose_answer(question, references)
def train_reward_model(reward_model, preference_pairs):
for answer_w, answer_l in preference_pairs:
reward_w = reward_model(answer_w)
reward_l = reward_model(answer_l)
loss = -log_sigmoid(reward_w - reward_l)
loss.backward()
===== Key Results =====
* Best model preferred over human demonstrator answers **56% of the time** on ELI5(([[https://arxiv.org/abs/1909.01066|Fan et al. "ELI5: Long Form Question Answering" (2019)]]))
* Preferred over highest-voted Reddit answers **69% of the time**
* Factually accurate **75% of the time**; both true and informative **54%**
* Significantly reduces hallucinations compared to unaided GPT-3(([[https://arxiv.org/abs/2005.14165|Brown et al. "Language Models are Few-Shot Learners" (GPT-3, 2020)]]))
* Demonstrates that web browsing + RLHF is a viable path to factual QA
===== See Also =====
* [[palm_e|PaLM-E: Embodied Multimodal Language Model]]
* [[reasoning_via_planning|RAP: Reasoning via Planning]]
* [[expel_experiential_learning|ExpeL: Experiential Learning Agents]]
===== References =====