Table of Contents

Game Agents: AI for Game Playing and Strategy

AI game agents powered by LLMs leverage language-guided policy generation and cross-game evaluation benchmarks to achieve generalization across thousands of diverse 3D video games without traditional reinforcement learning training. These game agents represent a new frontier in AI for game playing and strategy, moving beyond single-game specialists toward truly general-purpose players.1)

Overview

Game environments have served as foundational testing grounds for AI since the field's inception. The explosion of user-generated content (UGC) platforms has created thousands of diverse games that challenge traditional per-game AI approaches. PORTAL introduces language-guided behavior tree generation for playing thousands of 3D games, while the Orak benchmark evaluates LLM agents across 12 commercial titles spanning multiple genres.2)

PORTAL: Language-Guided Policy Generation

PORTAL transforms decision-making into language modeling by having LLMs generate behavior trees expressed in a domain-specific language (DSL). Key innovations include:

The policy generation process eliminates the computational burden of traditional RL:

<latex>\pi_{game} = \text{LLM}(\text{DSL}_{spec}, \text{API}_{game}, F_{quant}, F_{VLM})</latex>

where $\pi_{game}$ is the generated behavior tree policy, $\text{DSL}_{spec}$ defines the action space grammar, $\text{API}_{game}$ provides game-specific bindings, and $F_{quant}$, $F_{VLM}$ are feedback signals.

Cross-Game Generalization: Because policies are generated as interpretable DSL code rather than trained weights, PORTAL can instantly deploy to new games by updating only the API bindings. The LLM's prior knowledge about game mechanics enables zero-shot transfer.

Orak: Cross-Game Benchmark

Orak evaluates LLM agents across 12 popular commercial video games spanning action, puzzle, and RPG genres using a standardized interface built on the Model Context Protocol (MCP).

Evaluation Framework:

<latex>\text{Score}_{avg} = \frac{1}{|G|} \sum_{g \in G} S_g</latex>

where $G$ is the set of games and $S_g$ is the normalized performance score in game $g$.

Model Select Scores Strengths
o3-mini 3.3% - 91.7% Puzzles, action
Gemini-2.5-Pro 2.8% - 100% Strategic games
DeepSeek-R1 4.0% - 92% Reasoning-heavy
Qwen-2.5-7B 8.1% - 88.8% Specific genres

All models show high variance across games, highlighting generalization gaps.

Code Example

from dataclasses import dataclass
 
@dataclass
class BehaviorNode:
    node_type: str  # "selector", "sequence", "action", "condition"
    name: str
    children: list = None
    params: dict = None
 
class PortalAgent:
    def __init__(self, llm, game_api):
        self.llm = llm
        self.game_api = game_api
 
    def generate_behavior_tree(self, game_spec: dict) -> BehaviorNode:
        dsl_code = self.llm.generate(
            f"Generate a behavior tree in DSL for:\n"
            f"Game: {game_spec['name']}\n"
            f"API: {game_spec['actions']}\n"
            f"Objectives: {game_spec['objectives']}\n"
            f"Use selector/sequence/action/condition nodes."
        )
        return self.parse_dsl(dsl_code)
 
    def play_episode(self, behavior_tree: BehaviorNode) -> dict:
        metrics = {"kills": 0, "deaths": 0, "objectives": 0}
        while not self.game_api.is_done():
            state = self.game_api.get_state()
            action = self.evaluate_tree(behavior_tree, state)
            reward = self.game_api.step(action)
            self.update_metrics(metrics, reward)
        return metrics
 
    def iterative_improve(self, game_spec: dict,
                          n_iterations: int = 5) -> BehaviorNode:
        tree = self.generate_behavior_tree(game_spec)
        for i in range(n_iterations):
            metrics = self.play_episode(tree)
            vlm_feedback = self.analyze_gameplay(tree)
            tree = self.refine_tree(tree, metrics, vlm_feedback)
        return tree

Architecture

graph TD A[Game Specification + API] --> B[LLM Policy Generator] B --> C[Behavior Tree - DSL] C --> D[Rule-Based Nodes] C --> E[Neural Network Nodes] D --> F[Game Execution Engine] E --> F F --> G[Quantitative Metrics] F --> H[Gameplay Recording] H --> I[VLM Analysis] G --> J[Dual Feedback Aggregator] I --> J J --> K{Performance Threshold?} K -->|No| B K -->|Yes| L[Deploy Policy] subgraph Orak Benchmark M[12 Commercial Games] --> N[MCP Interface] N --> O[LLM Agent Under Test] O --> P[Genre-Specific Scoring] P --> Q[Leaderboard + Arena] end

See Also

References