====== WebGPT: Browser-Assisted Question Answering with Human Feedback ====== **WebGPT** is a pioneering web-browsing QA system developed by Nakano et al. (2021) at **OpenAI** that fine-tunes GPT-3 to answer long-form questions by interacting with a text-based web browser and optimizing responses through reinforcement learning from human feedback (RLHF).(([[https://arxiv.org/abs/2112.09332|Nakano et al. "WebGPT: Browser-assisted question-answering with human feedback" (2021)]])) With **1,720 citations**, it established foundational techniques for retrieval-augmented generation and web-browsing agents that influenced subsequent systems including ChatGPT's browsing capabilities. [[https://arxiv.org/abs/2112.09332|arXiv:2112.09332]] ===== Text-Based Web Browser Interface ===== WebGPT operates through a custom text-based browser environment that presents web pages as simplified text. The model issues structured commands: * **Search [query]**: Execute a Bing search query * **Click [link]**: Navigate to a specific link on the current page * **Scroll [direction]**: Move up or down the current page * **Quote [text]**: Extract and save a passage as a reference * **End**: Terminate browsing and compose the final answer At each step, the model receives a summary of the current browser state (page content, URL, scroll position) and selects the next action. The browsing session continues until the model issues an End command, reaches a maximum number of actions, or exceeds the total reference length. ===== RLHF Training Pipeline ===== The training process combines two complementary approaches: ==== Imitation Learning (Behavior Cloning) ==== Human demonstrators use the same text-based browser to answer questions, generating supervised training data of (question, browsing trajectory, answer) triples.(([[https://arxiv.org/abs/2203.02155|Ouyang et al. "Training language models to follow instructions with human feedback" (2022)]])) ==== Reward Modeling from Human Preferences ==== A reward model $R_\phi$ is trained on pairwise human comparisons: $$\mathcal{L}(\phi) = -\mathbb{E}_{(y_w, y_l) \sim \mathcal{D}} \left[\log \sigma\left(R_\phi(y_w) - R_\phi(y_l)\right)\right]$$ where $y_w$ is the human-preferred answer and $y_l$ is the rejected one. The policy is then optimized via PPO to maximize this learned reward while maintaining proximity to the behavior-cloned policy through a KL penalty: $$\max_\theta \mathbb{E}_{y \sim \pi_\theta} \left[R_\phi(y)\right] - \beta \, \text{KL}(\pi_\theta \| \pi_{\text{BC}})$$ ===== System Architecture ===== graph TD A[User Question] --> B[GPT-3 Agent] B --> C{Select Action} C --> D[Search via Bing API] C --> E[Click Link] C --> F[Scroll Page] C --> G[Quote Extract] C --> H[End Browsing] D --> I[Browser State Update] E --> I F --> I G --> J[Reference Store] I --> B H --> K[Compose Answer with References] J --> K K --> L[Final Answer] M[Human Demonstrations] --> N[Behavior Cloning] O[Human Preference Pairs] --> P[Reward Model] N --> Q[Initial Policy] P --> R[PPO Optimization] Q --> R R --> B ===== Code Example ===== # Simplified WebGPT browsing loop ACTIONS = ["search", "click", "scroll_up", "scroll_down", "quote", "end"] def webgpt_answer(question, policy_model, browser, max_steps=20): browser.reset() references = [] context = f"Question: {question}\n" for step in range(max_steps): browser_state = browser.get_state_summary() action, param = policy_model.predict(context + browser_state) if action == "search": browser.search(param) elif action == "click": browser.click(param) elif action == "quote": ref = browser.extract_quote(param) references.append(ref) elif action == "end": break context += f"Action: {action} {param}\n" return policy_model.compose_answer(question, references) def train_reward_model(reward_model, preference_pairs): for answer_w, answer_l in preference_pairs: reward_w = reward_model(answer_w) reward_l = reward_model(answer_l) loss = -log_sigmoid(reward_w - reward_l) loss.backward() ===== Key Results ===== * Best model preferred over human demonstrator answers **56% of the time** on ELI5(([[https://arxiv.org/abs/1909.01066|Fan et al. "ELI5: Long Form Question Answering" (2019)]])) * Preferred over highest-voted Reddit answers **69% of the time** * Factually accurate **75% of the time**; both true and informative **54%** * Significantly reduces hallucinations compared to unaided GPT-3(([[https://arxiv.org/abs/2005.14165|Brown et al. "Language Models are Few-Shot Learners" (GPT-3, 2020)]])) * Demonstrates that web browsing + RLHF is a viable path to factual QA ===== See Also ===== * [[palm_e|PaLM-E: Embodied Multimodal Language Model]] * [[reasoning_via_planning|RAP: Reasoning via Planning]] * [[expel_experiential_learning|ExpeL: Experiential Learning Agents]] ===== References =====