This is an old revision of the document!

SOTOPIA: Evaluating Social Intelligence of Language Agents

SOTOPIA is an open-ended simulation environment and benchmark for evaluating the social intelligence of AI agents through complex, multi-turn social interactions. Introduced by Zhou et al. (2023), it provides 90 procedurally generated social scenarios spanning cooperative, competitive, and mixed-motive settings, scored across seven sociological dimensions.¹⁾)

Overview

Traditional agent benchmarks focus on task completion in isolation. SOTOPIA addresses a critical gap: measuring how well agents navigate the nuanced social dynamics that characterize real human interaction. Agents are placed in realistic social scenarios – negotiating, persuading, maintaining relationships – and evaluated holistically using SOTOPIA-EVAL.

Interactions are modeled as partially observable Markov decision processes (POMDPs), where each agent acts based on limited observations:

where <latex>o_{1:t}</latex> are past observations, <latex>s_t</latex> is the agent's internal state, and <latex>g</latex> is the social goal.

The Seven Social Dimensions

SOTOPIA-EVAL scores agents across seven dimensions inspired by sociology, psychology, and economics:

Dimension	Range	Description
Goal Completion (GOAL)	[0, 10]	Extent of achieving the primary social goal
Believability (BEL)	[0, 10]	Fidelity to assigned persona and character consistency
Knowledge (KNO)	[0, 10]	Effectiveness in acquiring relevant information
Secret (SEC)	[-10, 0]	Success in concealing private information
Relationship (REL)	[-5, 5]	Net social value created; relationship maintenance
Social Rules (SOC)	[-10, 0]	Adherence to social, legal, and ethical norms
Financial/Material (FIN)	[-5, 5]	Impact on tangible financial or material outcomes

The overall score is computed as:

<latex>S_{\text{overall}} = \frac{1}{7} \sum_{d \in D} S_d</latex>

Architecture and Methodology

graph TD A[Scenario Generator] --> B[Character Assignment] B --> C[Goal Specification] C --> D[Multi-turn Interaction] D --> E[Agent 1 Response] D --> F[Agent 2 Response] E --> G[SOTOPIA-EVAL] F --> G G --> H[7-Dimension Scoring] H --> I[Aggregated Social Score] D -->|Turn Loop| D

Scenarios are procedurally generated with automated character creation, relationship assignment, and goal specification. This enables scalable simulation across diverse social contexts.

Key Results

GPT-4 agents underperform humans across most social dimensions
Sotopia-RL training yields goal completion scores of 7.17 on the hard benchmark and 8.31 on the full dataset²⁾)
Behavior cloning + self-reinforcement pushes 7B-parameter models near GPT-4 on goal completion
Structured social context (S3AP tuples) improves performance by up to +18% on hard scenarios
LLM-based evaluators approximate human judgment on goal completion but risk overestimation

Code Example

# Evaluating an agent pair in SOTOPIA using the framework API
from sotopia.envs import ParallelSotopiaEnv
from sotopia.agents import LLMAgent
 
env = ParallelSotopiaEnv(scenario_id="negotiation_01")
agent1 = LLMAgent(model="gpt-4", persona="assertive_negotiator")
agent2 = LLMAgent(model="gpt-4", persona="cooperative_partner")
 
obs = env.reset()
done = False
while not done:
    actions = {
        "agent1": agent1.act(obs["agent1"]),
        "agent2": agent2.act(obs["agent2"]),
    }
    obs, rewards, done, info = env.step(actions)
 
# Retrieve 7-dimension evaluation scores
scores = env.evaluate()
for dim, score in scores.items():
    print(f"{dim}: {score:.2f}")

AI Agent Knowledge Base

Sidebar

Table of Contents

SOTOPIA: Evaluating Social Intelligence of Language Agents

Overview

The Seven Social Dimensions

Architecture and Methodology

Key Results

Code Example

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

SOTOPIA: Evaluating Social Intelligence of Language Agents

Overview

The Seven Social Dimensions

Architecture and Methodology

Key Results

Code Example

References

See Also

Page Tools