This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| webgpt [2026/03/25 15:25] – Create WebGPT page: browser-assisted QA with RLHF agent | webgpt [2026/03/30 22:39] (current) – Restructure: footnotes as references agent | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== WebGPT: Browser-Assisted Question Answering with Human Feedback ====== | ====== WebGPT: Browser-Assisted Question Answering with Human Feedback ====== | ||
| - | **WebGPT** is a pioneering web-browsing QA system developed by Nakano et al. (2021) at **OpenAI** that fine-tunes GPT-3 to answer long-form questions by interacting with a text-based web browser and optimizing responses through reinforcement learning from human feedback (RLHF). With **1,720 citations**, | + | **WebGPT** is a pioneering web-browsing QA system developed by Nakano et al. (2021) at **OpenAI** that fine-tunes GPT-3 to answer long-form questions by interacting with a text-based web browser and optimizing responses through reinforcement learning from human feedback (RLHF).(([[https:// |
| [[https:// | [[https:// | ||
| Line 23: | Line 23: | ||
| ==== Imitation Learning (Behavior Cloning) ==== | ==== Imitation Learning (Behavior Cloning) ==== | ||
| - | Human demonstrators use the same text-based browser to answer questions, generating supervised training data of (question, browsing trajectory, answer) triples. | + | Human demonstrators use the same text-based browser to answer questions, generating supervised training data of (question, browsing trajectory, answer) triples.(([[https:// |
| ==== Reward Modeling from Human Preferences ==== | ==== Reward Modeling from Human Preferences ==== | ||
| Line 100: | Line 100: | ||
| ===== Key Results ===== | ===== Key Results ===== | ||
| - | * Best model preferred over human demonstrator answers **56% of the time** on ELI5 | + | * Best model preferred over human demonstrator answers **56% of the time** on ELI5(([[https:// |
| * Preferred over highest-voted Reddit answers **69% of the time** | * Preferred over highest-voted Reddit answers **69% of the time** | ||
| * Factually accurate **75% of the time**; both true and informative **54%** | * Factually accurate **75% of the time**; both true and informative **54%** | ||
| - | * Significantly reduces hallucinations compared to unaided GPT-3 | + | * Significantly reduces hallucinations compared to unaided GPT-3(([[https:// |
| * Demonstrates that web browsing + RLHF is a viable path to factual QA | * Demonstrates that web browsing + RLHF is a viable path to factual QA | ||
| - | |||
| - | ===== References ===== | ||
| - | |||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | * [[https:// | ||
| ===== See Also ===== | ===== See Also ===== | ||
| Line 118: | Line 111: | ||
| * [[reasoning_via_planning|RAP: | * [[reasoning_via_planning|RAP: | ||
| * [[expel_experiential_learning|ExpeL: | * [[expel_experiential_learning|ExpeL: | ||
| + | |||
| + | ===== References ===== | ||