====== Competitive Programming Agents ======
Multi-agent LLM systems for competitive programming go beyond mere code correctness to address algorithmic efficiency, time/memory constraints, and complexity-aware optimization at IOI and ICPC competition levels.
===== Overview =====
Large language models can generate syntactically correct code, but competitive programming demands more: solutions must satisfy strict time and memory budgets while solving algorithmically complex problems. Two key research directions have emerged. SwiftSolve introduces a multi-agent framework with complexity-aware profiling and repair(([[https://arxiv.org/abs/2510.22626|Singh et al. "SwiftSolve: A Self-Iterative, Complexity-Aware Multi-Agent Framework for Competitive Programming" (2025)]])). AetherCode provides a rigorous benchmark revealing that even frontier LLMs struggle with elite-level competition problems(([[https://arxiv.org/abs/2508.16402|"AetherCode: Evaluating LLMs Ability to Win in Premier Programming Competitions" (2025)]])).
===== SwiftSolve: Complexity-Aware Multi-Agent Framework =====
SwiftSolve frames competitive programming as a software engineering environment where specialized agents collaborate through typed, versioned JSON messages:
* **Planner Agent**: Proposes an algorithmic sketch based on problem analysis
* **Static Pruner**: A deterministic filter that eliminates high-risk algorithmic plans before execution
* **Coder Agent**: Generates ISO C++17 implementations from approved plans
* **Profiler Agent**: Compiles and executes candidates on a fixed input-size schedule, recording wall time and peak memory
* **Complexity Analyst**: Fits log-log growth curves to determine empirical complexity class
The complexity analysis fits a log-log regression model:
\log T(n) = s \cdot \log n + b
where $T(n)$ is the measured runtime, $n$ is the input size, and the slope $s$ determines the empirical complexity class (e.g., $s \approx 1$ implies $O(n)$, $s \approx 2$ implies $O(n^2)$). The fit quality is measured by $R^2$, with an LLM fallback when $R^2$ is low.
The efficiency metric extends the standard pass@k:
\text{eff@k} = \frac{|\{i : \text{correct}_i \wedge T_i \leq T_{limit} \wedge M_i \leq M_{limit}\}|}{k}
A controller enforces iteration caps and diminishing-returns stopping criteria across the agent loop.
===== AetherCode Benchmark =====
AetherCode addresses limitations in prior benchmarks by using harder problems from IOI and ICPC competitions with expert-validated test suites. Problems are categorized by difficulty (Easy, Medium, Hard, Extreme) and recency (2024, 2025).
^ Model ^ Easy ^ Medium ^ Hard ^ Extreme ^ Pass@1 ^
| o4-mini-high | 65.3% | 32.1% | 8.0% | 3.8% | 35.5% |
| Gemini-2.5-Pro | 60.1% | 28.6% | 8.5% | 2.5% | 32.7% |
| GPT-4.1 | 23.9% | 5.7% | 1.1% | 0% | 10.5% |
Even frontier reasoning models achieve only ~35% Pass@1, far below human expert performance on elite problems.
===== Code Example =====
import json
import subprocess
import numpy as np
from dataclasses import dataclass
@dataclass
class ProfileResult:
runtime_ms: float
memory_kb: float
correct: bool
def fit_complexity(input_sizes: list[int], runtimes: list[float]) -> tuple[float, float]:
log_n = np.log(input_sizes)
log_t = np.log(runtimes)
slope, intercept = np.polyfit(log_n, log_t, 1)
ss_res = np.sum((log_t - (slope * log_n + intercept)) ** 2)
ss_tot = np.sum((log_t - np.mean(log_t)) ** 2)
r_squared = 1 - ss_res / ss_tot
return slope, r_squared
def evaluate_solution(code: str, test_cases: list[dict],
time_limit_ms: float = 2000,
memory_limit_kb: float = 262144) -> dict:
results = []
for tc in test_cases:
proc = subprocess.run(
["./sandbox", "--time", str(time_limit_ms),
"--mem", str(memory_limit_kb)],
input=tc["input"], capture_output=True, text=True
)
results.append(ProfileResult(
runtime_ms=float(proc.stderr.split(",")[0]),
memory_kb=float(proc.stderr.split(",")[1]),
correct=(proc.stdout.strip() == tc["expected"])
))
pass_rate = sum(r.correct for r in results) / len(results)
eff_rate = sum(
r.correct and r.runtime_ms <= time_limit_ms
and r.memory_kb <= memory_limit_kb for r in results
) / len(results)
return {"pass@1": pass_rate, "eff@1": eff_rate}
===== Architecture =====
graph TD
A[Problem Statement] --> B[Planner Agent]
B --> C{Static Pruner}
C -->|Approved| D[Coder Agent - C++17]
C -->|Rejected| B
D --> E[Profiler Agent]
E --> F[Complexity Analyst]
F --> G{Complexity OK?}
G -->|Yes| H{Correctness Check}
G -->|No - Algorithmic Issue| B
G -->|No - Implementation Issue| D
H -->|Pass| I[Submit Solution]
H -->|Fail| D
E --> J[Log-Log Regression]
J --> F
subgraph Feedback Loop
K[Controller] --> L{Diminishing Returns?}
L -->|No| B
L -->|Yes| M[Best Solution]
end
===== Key Results =====
^ System ^ Metric ^ Value ^
| SwiftSolve | Pass@1 | 61.54% (16/26 problems) |
| SwiftSolve | Solved@<=3 | 80.77% |
| SwiftSolve | Run-level success | 73.08% vs 52.6% (Claude Opus baseline) |
| SwiftSolve | Mean runtime overhead | 12.4s vs 6.8s (single agent) |
| AetherCode | Best Pass@1 (Extreme) | 3.8% (o4-mini-high) |
===== See Also =====
* [[software_testing_agents|Software Testing Agents]]
* [[budget_aware_reasoning|Budget-Aware Reasoning]]
* [[game_playing_agents|Game Playing Agents]]
===== References =====