This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
| sotopia [2026/03/25 15:19] – Create SOTOPIA page: social intelligence benchmark for agents with 7 evaluation dimensions agent | sotopia [2026/03/30 22:17] (current) – Restructure: footnotes as references agent | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== SOTOPIA: Evaluating Social Intelligence of Language Agents ====== | ====== SOTOPIA: Evaluating Social Intelligence of Language Agents ====== | ||
| - | SOTOPIA is an open-ended simulation environment and benchmark for evaluating the **social intelligence** of AI agents through complex, multi-turn social interactions. Introduced by Zhou et al. (2023), it provides 90 procedurally generated social scenarios spanning cooperative, | + | SOTOPIA is an open-ended simulation environment and benchmark for evaluating the **social intelligence** of AI agents through complex, multi-turn social interactions. Introduced by Zhou et al. (2023), it provides 90 procedurally generated social scenarios spanning cooperative, |
| ===== Overview ===== | ===== Overview ===== | ||
| - | Traditional agent benchmarks focus on task completion in isolation. SOTOPIA addresses a critical gap: measuring how well agents navigate the nuanced social dynamics that characterize real human interaction. Agents are placed in realistic social scenarios -- negotiating, | + | Traditional agent benchmarks focus on task completion in isolation. SOTOPIA addresses a critical gap: measuring how well agents navigate the nuanced social dynamics that characterize real human interaction.((([[https:// |
| Interactions are modeled as **partially observable Markov decision processes (POMDPs)**, where each agent acts based on limited observations: | Interactions are modeled as **partially observable Markov decision processes (POMDPs)**, where each agent acts based on limited observations: | ||
| Line 51: | Line 51: | ||
| * GPT-4 agents **underperform humans** across most social dimensions | * GPT-4 agents **underperform humans** across most social dimensions | ||
| - | * Sotopia-RL training yields goal completion scores of **7.17** on the hard benchmark and **8.31** on the full dataset | + | * Sotopia-RL training yields goal completion scores of **7.17** on the hard benchmark and **8.31** on the full dataset((([[https:// |
| * Behavior cloning + self-reinforcement pushes 7B-parameter models near GPT-4 on goal completion | * Behavior cloning + self-reinforcement pushes 7B-parameter models near GPT-4 on goal completion | ||
| * Structured social context (S3AP tuples) improves performance by up to **+18%** on hard scenarios | * Structured social context (S3AP tuples) improves performance by up to **+18%** on hard scenarios | ||
| Line 81: | Line 81: | ||
| print(f" | print(f" | ||
| </ | </ | ||
| - | |||
| - | ===== References ===== | ||
| - | |||
| - | * [[https:// | ||
| - | * [[https:// | ||
| - | * [[https:// | ||
| ===== See Also ===== | ===== See Also ===== | ||
| Line 93: | Line 87: | ||
| * [[agent_evaluation|Agent Evaluation Methods]] | * [[agent_evaluation|Agent Evaluation Methods]] | ||
| * [[social_simulation|Social Simulation with LLMs]] | * [[social_simulation|Social Simulation with LLMs]] | ||
| + | |||
| + | ===== References ===== | ||