Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
WebGPT is a pioneering web-browsing QA system developed by Nakano et al. (2021) at OpenAI that fine-tunes GPT-3 to answer long-form questions by interacting with a text-based web browser and optimizing responses through reinforcement learning from human feedback (RLHF). With 1,720 citations, it established foundational techniques for retrieval-augmented generation and web-browsing agents that influenced subsequent systems including ChatGPT's browsing capabilities.
WebGPT operates through a custom text-based browser environment that presents web pages as simplified text. The model issues structured commands:
At each step, the model receives a summary of the current browser state (page content, URL, scroll position) and selects the next action. The browsing session continues until the model issues an End command, reaches a maximum number of actions, or exceeds the total reference length.
The training process combines two complementary approaches:
Human demonstrators use the same text-based browser to answer questions, generating supervised training data of (question, browsing trajectory, answer) triples.
A reward model $R_\phi$ is trained on pairwise human comparisons:
$$\mathcal{L}(\phi) = -\mathbb{E}_{(y_w, y_l) \sim \mathcal{D}} \left[\log \sigma\left(R_\phi(y_w) - R_\phi(y_l)\right)\right]$$
where $y_w$ is the human-preferred answer and $y_l$ is the rejected one. The policy is then optimized via PPO to maximize this learned reward while maintaining proximity to the behavior-cloned policy through a KL penalty:
$$\max_\theta \mathbb{E}_{y \sim \pi_\theta} \left[R_\phi(y)\right] - \beta \, \text{KL}(\pi_\theta \| \pi_{\text{BC}})$$
# Simplified WebGPT browsing loop ACTIONS = ["search", "click", "scroll_up", "scroll_down", "quote", "end"] def webgpt_answer(question, policy_model, browser, max_steps=20): browser.reset() references = [] context = f"Question: {question}\n" for step in range(max_steps): browser_state = browser.get_state_summary() action, param = policy_model.predict(context + browser_state) if action == "search": browser.search(param) elif action == "click": browser.click(param) elif action == "quote": ref = browser.extract_quote(param) references.append(ref) elif action == "end": break context += f"Action: {action} {param}\n" return policy_model.compose_answer(question, references) def train_reward_model(reward_model, preference_pairs): for answer_w, answer_l in preference_pairs: reward_w = reward_model(answer_w) reward_l = reward_model(answer_l) loss = -log_sigmoid(reward_w - reward_l) loss.backward()