Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Multi-agent LLM systems for competitive programming go beyond mere code correctness to address algorithmic efficiency, time/memory constraints, and complexity-aware optimization at IOI and ICPC competition levels.
Large language models can generate syntactically correct code, but competitive programming demands more: solutions must satisfy strict time and memory budgets while solving algorithmically complex problems. Two key research directions have emerged. SwiftSolve introduces a multi-agent framework with complexity-aware profiling and repair. AetherCode provides a rigorous benchmark revealing that even frontier LLMs struggle with elite-level competition problems.
SwiftSolve frames competitive programming as a software engineering environment where specialized agents collaborate through typed, versioned JSON messages:
The complexity analysis fits a log-log regression model:
<latex>\log T(n) = s \cdot \log n + b</latex>
where $T(n)$ is the measured runtime, $n$ is the input size, and the slope $s$ determines the empirical complexity class (e.g., $s pprox 1$ implies $O(n)$, $s pprox 2$ implies $O(n^2)$). The fit quality is measured by $R^2$, with an LLM fallback when $R^2$ is low.
The efficiency metric extends the standard pass@k:
<latex> ext{eff@k} = rac{|\{i : ext{correct}_i \wedge T_i \leq T_{limit} \wedge M_i \leq M_{limit}\}|}{k}</latex>
A controller enforces iteration caps and diminishing-returns stopping criteria across the agent loop.
AetherCode addresses limitations in prior benchmarks by using harder problems from IOI and ICPC competitions with expert-validated test suites. Problems are categorized by difficulty (Easy, Medium, Hard, Extreme) and recency (2024, 2025).
| Model | Easy | Medium | Hard | Extreme | Pass@1 |
|---|---|---|---|---|---|
| o4-mini-high | 65.3% | 32.1% | 8.0% | 3.8% | 35.5% |
| Gemini-2.5-Pro | 60.1% | 28.6% | 8.5% | 2.5% | 32.7% |
| GPT-4.1 | 23.9% | 5.7% | 1.1% | 0% | 10.5% |
Even frontier reasoning models achieve only ~35% Pass@1, far below human expert performance on elite problems.
import json import subprocess import numpy as np from dataclasses import dataclass @dataclass class ProfileResult: runtime_ms: float memory_kb: float correct: bool def fit_complexity(input_sizes: list[int], runtimes: list[float]) -> tuple[float, float]: log_n = np.log(input_sizes) log_t = np.log(runtimes) slope, intercept = np.polyfit(log_n, log_t, 1) ss_res = np.sum((log_t - (slope * log_n + intercept)) ** 2) ss_tot = np.sum((log_t - np.mean(log_t)) ** 2) r_squared = 1 - ss_res / ss_tot return slope, r_squared def evaluate_solution(code: str, test_cases: list[dict], time_limit_ms: float = 2000, memory_limit_kb: float = 262144) -> dict: results = [] for tc in test_cases: proc = subprocess.run( ["./sandbox", "--time", str(time_limit_ms), "--mem", str(memory_limit_kb)], input=tc["input"], capture_output=True, text=True ) results.append(ProfileResult( runtime_ms=float(proc.stderr.split(",")[0]), memory_kb=float(proc.stderr.split(",")[1]), correct=(proc.stdout.strip() == tc["expected"]) )) pass_rate = sum(r.correct for r in results) / len(results) eff_rate = sum( r.correct and r.runtime_ms <= time_limit_ms and r.memory_kb <= memory_limit_kb for r in results ) / len(results) return {"pass@1": pass_rate, "eff@1": eff_rate}
| System | Metric | Value |
|---|---|---|
| SwiftSolve | Pass@1 | 61.54% (16/26 problems) |
| SwiftSolve | Solved@⇐3 | 80.77% |
| SwiftSolve | Run-level success | 73.08% vs 52.6% (Claude Opus baseline) |
| SwiftSolve | Mean runtime overhead | 12.4s vs 6.8s (single agent) |
| AetherCode | Best Pass@1 (Extreme) | 3.8% (o4-mini-high) |