This is an old revision of the document!

Competitive Programming Agents

Multi-agent LLM systems for competitive programming go beyond mere code correctness to address algorithmic efficiency, time/memory constraints, and complexity-aware optimization at IOI and ICPC competition levels.

Overview

Large language models can generate syntactically correct code, but competitive programming demands more: solutions must satisfy strict time and memory budgets while solving algorithmically complex problems. Two key research directions have emerged. SwiftSolve introduces a multi-agent framework with complexity-aware profiling and repair. AetherCode provides a rigorous benchmark revealing that even frontier LLMs struggle with elite-level competition problems.

SwiftSolve: Complexity-Aware Multi-Agent Framework

SwiftSolve frames competitive programming as a software engineering environment where specialized agents collaborate through typed, versioned JSON messages:

Planner Agent: Proposes an algorithmic sketch based on problem analysis
Static Pruner: A deterministic filter that eliminates high-risk algorithmic plans before execution
Coder Agent: Generates ISO C++17 implementations from approved plans
Profiler Agent: Compiles and executes candidates on a fixed input-size schedule, recording wall time and peak memory
Complexity Analyst: Fits log-log growth curves to determine empirical complexity class

The complexity analysis fits a log-log regression model:

where $T(n)$ is the measured runtime, $n$ is the input size, and the slope $s$ determines the empirical complexity class (e.g., $s pprox 1$ implies $O(n)$, $s pprox 2$ implies $O(n^2)$). The fit quality is measured by $R^2$, with an LLM fallback when $R^2$ is low.

The efficiency metric extends the standard pass@k:

<latex> ext{eff@k} = rac{|\{i : ext{correct}_i \wedge T_i \leq T_{limit} \wedge M_i \leq M_{limit}\}|}{k}</latex>

A controller enforces iteration caps and diminishing-returns stopping criteria across the agent loop.

AetherCode Benchmark

AetherCode addresses limitations in prior benchmarks by using harder problems from IOI and ICPC competitions with expert-validated test suites. Problems are categorized by difficulty (Easy, Medium, Hard, Extreme) and recency (2024, 2025).

Model	Easy	Medium	Hard	Extreme	Pass@1
o4-mini-high	65.3%	32.1%	8.0%	3.8%	35.5%
Gemini-2.5-Pro	60.1%	28.6%	8.5%	2.5%	32.7%
GPT-4.1	23.9%	5.7%	1.1%	0%	10.5%

Even frontier reasoning models achieve only ~35% Pass@1, far below human expert performance on elite problems.

Code Example

import json
import subprocess
import numpy as np
from dataclasses import dataclass
 
@dataclass
class ProfileResult:
    runtime_ms: float
    memory_kb: float
    correct: bool
 
def fit_complexity(input_sizes: list[int], runtimes: list[float]) -> tuple[float, float]:
    log_n = np.log(input_sizes)
    log_t = np.log(runtimes)
    slope, intercept = np.polyfit(log_n, log_t, 1)
    ss_res = np.sum((log_t - (slope * log_n + intercept)) ** 2)
    ss_tot = np.sum((log_t - np.mean(log_t)) ** 2)
    r_squared = 1 - ss_res / ss_tot
    return slope, r_squared
 
def evaluate_solution(code: str, test_cases: list[dict],
                      time_limit_ms: float = 2000,
                      memory_limit_kb: float = 262144) -> dict:
    results = []
    for tc in test_cases:
        proc = subprocess.run(
            ["./sandbox", "--time", str(time_limit_ms),
             "--mem", str(memory_limit_kb)],
            input=tc["input"], capture_output=True, text=True
        )
        results.append(ProfileResult(
            runtime_ms=float(proc.stderr.split(",")[0]),
            memory_kb=float(proc.stderr.split(",")[1]),
            correct=(proc.stdout.strip() == tc["expected"])
        ))
    pass_rate = sum(r.correct for r in results) / len(results)
    eff_rate = sum(
        r.correct and r.runtime_ms <= time_limit_ms
        and r.memory_kb <= memory_limit_kb for r in results
    ) / len(results)
    return {"pass@1": pass_rate, "eff@1": eff_rate}

Architecture

graph TD A[Problem Statement] --> B[Planner Agent] B --> C{Static Pruner} C -->|Approved| D[Coder Agent - C++17] C -->|Rejected| B D --> E[Profiler Agent] E --> F[Complexity Analyst] F --> G{Complexity OK?} G -->|Yes| H{Correctness Check} G -->|No - Algorithmic Issue| B G -->|No - Implementation Issue| D H -->|Pass| I[Submit Solution] H -->|Fail| D E --> J[Log-Log Regression] J --> F subgraph Feedback Loop K[Controller] --> L{Diminishing Returns?} L -->|No| B L -->|Yes| M[Best Solution] end

Key Results

System	Metric	Value
SwiftSolve	Pass@1	61.54% (16/26 problems)
SwiftSolve	Solved@⇐3	80.77%
SwiftSolve	Run-level success	73.08% vs 52.6% (Claude Opus baseline)
SwiftSolve	Mean runtime overhead	12.4s vs 6.8s (single agent)
AetherCode	Best Pass@1 (Extreme)	3.8% (o4-mini-high)

AI Agent Knowledge Base

Sidebar

Table of Contents

Competitive Programming Agents

Overview

SwiftSolve: Complexity-Aware Multi-Agent Framework

AetherCode Benchmark

Code Example

Architecture

Key Results

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Competitive Programming Agents

Overview

SwiftSolve: Complexity-Aware Multi-Agent Framework

AetherCode Benchmark

Code Example

Architecture

Key Results

References

See Also

Page Tools