Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
LLM-powered agents for game playing leverage language-guided policy generation and cross-game evaluation benchmarks to achieve generalization across thousands of diverse 3D video games without traditional reinforcement learning training.
Game environments have served as foundational testing grounds for AI since the field's inception. The explosion of user-generated content (UGC) platforms has created thousands of diverse games that challenge traditional per-game AI approaches. PORTAL introduces language-guided behavior tree generation for playing thousands of 3D games, while the Orak benchmark evaluates LLM agents across 12 commercial titles spanning multiple genres.
PORTAL transforms decision-making into language modeling by having LLMs generate behavior trees expressed in a domain-specific language (DSL). Key innovations include:
The policy generation process eliminates the computational burden of traditional RL:
<latex>\pi_{game} = \text{LLM}(\text{DSL}_{spec}, \text{API}_{game}, F_{quant}, F_{VLM})</latex>
where $\pi_{game}$ is the generated behavior tree policy, $\text{DSL}_{spec}$ defines the action space grammar, $\text{API}_{game}$ provides game-specific bindings, and $F_{quant}$, $F_{VLM}$ are feedback signals.
Cross-Game Generalization: Because policies are generated as interpretable DSL code rather than trained weights, PORTAL can instantly deploy to new games by updating only the API bindings. The LLM's prior knowledge about game mechanics enables zero-shot transfer.
Orak evaluates LLM agents across 12 popular commercial video games spanning action, puzzle, and RPG genres using a standardized interface built on the Model Context Protocol (MCP).
Evaluation Framework:
<latex>\text{Score}_{avg} = \frac{1}{|G|} \sum_{g \in G} S_g</latex>
where $G$ is the set of games and $S_g$ is the normalized performance score in game $g$.
| Model | Select Scores | Strengths |
|---|---|---|
| o3-mini | 3.3% - 91.7% | Puzzles, action |
| Gemini-2.5-Pro | 2.8% - 100% | Strategic games |
| DeepSeek-R1 | 4.0% - 92% | Reasoning-heavy |
| Qwen-2.5-7B | 8.1% - 88.8% | Specific genres |
All models show high variance across games, highlighting generalization gaps.
from dataclasses import dataclass @dataclass class BehaviorNode: node_type: str # "selector", "sequence", "action", "condition" name: str children: list = None params: dict = None class PortalAgent: def __init__(self, llm, game_api): self.llm = llm self.game_api = game_api def generate_behavior_tree(self, game_spec: dict) -> BehaviorNode: dsl_code = self.llm.generate( f"Generate a behavior tree in DSL for:\n" f"Game: {game_spec['name']}\n" f"API: {game_spec['actions']}\n" f"Objectives: {game_spec['objectives']}\n" f"Use selector/sequence/action/condition nodes." ) return self.parse_dsl(dsl_code) def play_episode(self, behavior_tree: BehaviorNode) -> dict: metrics = {"kills": 0, "deaths": 0, "objectives": 0} while not self.game_api.is_done(): state = self.game_api.get_state() action = self.evaluate_tree(behavior_tree, state) reward = self.game_api.step(action) self.update_metrics(metrics, reward) return metrics def iterative_improve(self, game_spec: dict, n_iterations: int = 5) -> BehaviorNode: tree = self.generate_behavior_tree(game_spec) for i in range(n_iterations): metrics = self.play_episode(tree) vlm_feedback = self.analyze_gameplay(tree) tree = self.refine_tree(tree, metrics, vlm_feedback) return tree