====== Competitive Programming Agents ====== Multi-agent LLM systems for competitive programming go beyond mere code correctness to address algorithmic efficiency, time/memory constraints, and complexity-aware optimization at IOI and ICPC competition levels. ===== Overview ===== Large language models can generate syntactically correct code, but competitive programming demands more: solutions must satisfy strict time and memory budgets while solving algorithmically complex problems. Two key research directions have emerged. SwiftSolve introduces a multi-agent framework with complexity-aware profiling and repair(([[https://arxiv.org/abs/2510.22626|Singh et al. "SwiftSolve: A Self-Iterative, Complexity-Aware Multi-Agent Framework for Competitive Programming" (2025)]])). AetherCode provides a rigorous benchmark revealing that even frontier LLMs struggle with elite-level competition problems(([[https://arxiv.org/abs/2508.16402|"AetherCode: Evaluating LLMs Ability to Win in Premier Programming Competitions" (2025)]])). ===== SwiftSolve: Complexity-Aware Multi-Agent Framework ===== SwiftSolve frames competitive programming as a software engineering environment where specialized agents collaborate through typed, versioned JSON messages: * **Planner Agent**: Proposes an algorithmic sketch based on problem analysis * **Static Pruner**: A deterministic filter that eliminates high-risk algorithmic plans before execution * **Coder Agent**: Generates ISO C++17 implementations from approved plans * **Profiler Agent**: Compiles and executes candidates on a fixed input-size schedule, recording wall time and peak memory * **Complexity Analyst**: Fits log-log growth curves to determine empirical complexity class The complexity analysis fits a log-log regression model: \log T(n) = s \cdot \log n + b where $T(n)$ is the measured runtime, $n$ is the input size, and the slope $s$ determines the empirical complexity class (e.g., $s \approx 1$ implies $O(n)$, $s \approx 2$ implies $O(n^2)$). The fit quality is measured by $R^2$, with an LLM fallback when $R^2$ is low. The efficiency metric extends the standard pass@k: \text{eff@k} = \frac{|\{i : \text{correct}_i \wedge T_i \leq T_{limit} \wedge M_i \leq M_{limit}\}|}{k} A controller enforces iteration caps and diminishing-returns stopping criteria across the agent loop. ===== AetherCode Benchmark ===== AetherCode addresses limitations in prior benchmarks by using harder problems from IOI and ICPC competitions with expert-validated test suites. Problems are categorized by difficulty (Easy, Medium, Hard, Extreme) and recency (2024, 2025). ^ Model ^ Easy ^ Medium ^ Hard ^ Extreme ^ Pass@1 ^ | o4-mini-high | 65.3% | 32.1% | 8.0% | 3.8% | 35.5% | | Gemini-2.5-Pro | 60.1% | 28.6% | 8.5% | 2.5% | 32.7% | | GPT-4.1 | 23.9% | 5.7% | 1.1% | 0% | 10.5% | Even frontier reasoning models achieve only ~35% Pass@1, far below human expert performance on elite problems. ===== Code Example ===== import json import subprocess import numpy as np from dataclasses import dataclass @dataclass class ProfileResult: runtime_ms: float memory_kb: float correct: bool def fit_complexity(input_sizes: list[int], runtimes: list[float]) -> tuple[float, float]: log_n = np.log(input_sizes) log_t = np.log(runtimes) slope, intercept = np.polyfit(log_n, log_t, 1) ss_res = np.sum((log_t - (slope * log_n + intercept)) ** 2) ss_tot = np.sum((log_t - np.mean(log_t)) ** 2) r_squared = 1 - ss_res / ss_tot return slope, r_squared def evaluate_solution(code: str, test_cases: list[dict], time_limit_ms: float = 2000, memory_limit_kb: float = 262144) -> dict: results = [] for tc in test_cases: proc = subprocess.run( ["./sandbox", "--time", str(time_limit_ms), "--mem", str(memory_limit_kb)], input=tc["input"], capture_output=True, text=True ) results.append(ProfileResult( runtime_ms=float(proc.stderr.split(",")[0]), memory_kb=float(proc.stderr.split(",")[1]), correct=(proc.stdout.strip() == tc["expected"]) )) pass_rate = sum(r.correct for r in results) / len(results) eff_rate = sum( r.correct and r.runtime_ms <= time_limit_ms and r.memory_kb <= memory_limit_kb for r in results ) / len(results) return {"pass@1": pass_rate, "eff@1": eff_rate} ===== Architecture ===== graph TD A[Problem Statement] --> B[Planner Agent] B --> C{Static Pruner} C -->|Approved| D[Coder Agent - C++17] C -->|Rejected| B D --> E[Profiler Agent] E --> F[Complexity Analyst] F --> G{Complexity OK?} G -->|Yes| H{Correctness Check} G -->|No - Algorithmic Issue| B G -->|No - Implementation Issue| D H -->|Pass| I[Submit Solution] H -->|Fail| D E --> J[Log-Log Regression] J --> F subgraph Feedback Loop K[Controller] --> L{Diminishing Returns?} L -->|No| B L -->|Yes| M[Best Solution] end ===== Key Results ===== ^ System ^ Metric ^ Value ^ | SwiftSolve | Pass@1 | 61.54% (16/26 problems) | | SwiftSolve | Solved@<=3 | 80.77% | | SwiftSolve | Run-level success | 73.08% vs 52.6% (Claude Opus baseline) | | SwiftSolve | Mean runtime overhead | 12.4s vs 6.8s (single agent) | | AetherCode | Best Pass@1 (Extreme) | 3.8% (o4-mini-high) | ===== See Also ===== * [[software_testing_agents|Software Testing Agents]] * [[budget_aware_reasoning|Budget-Aware Reasoning]] * [[game_playing_agents|Game Playing Agents]] ===== References =====