====== Competitive Programming Agents ====== Multi-agent LLM systems for competitive programming go beyond mere code correctness to address algorithmic efficiency, time/memory constraints, and complexity-aware optimization at IOI and ICPC competition levels. ===== Overview ===== Large language models can generate syntactically correct code, but competitive programming demands more: solutions must satisfy strict time and memory budgets while solving algorithmically complex problems. Two key research directions have emerged. SwiftSolve introduces a multi-agent framework with complexity-aware profiling and repair(([[https://arxiv.org/abs/2510.22626|Singh et al. "SwiftSolve: A Self-Iterative, Complexity-Aware Multi-Agent Framework for Competitive Programming" (2025)]])). AetherCode provides a rigorous benchmark revealing that even frontier LLMs struggle with elite-level competition problems(([[https://arxiv.org/abs/2508.16402|"AetherCode: Evaluating LLMs Ability to Win in Premier Programming Competitions" (2025)]])). ===== SwiftSolve: Complexity-Aware Multi-Agent Framework ===== SwiftSolve frames competitive programming as a software engineering environment where specialized agents collaborate through typed, versioned JSON messages: * **Planner Agent**: Proposes an algorithmic sketch based on problem analysis * **Static Pruner**: A deterministic filter that eliminates high-risk algorithmic plans before execution * **Coder Agent**: Generates ISO C++17 implementations from approved plans * **Profiler Agent**: Compiles and executes candidates on a fixed input-size schedule, recording wall time and peak memory * **Complexity Analyst**: Fits log-log growth curves to determine empirical complexity class The complexity analysis fits a log-log regression model: \log T(n) = s \cdot \log n + b where $T(n)$ is the measured runtime, $n$ is the input size, and the slope $s$ determines the empirical complexity class (e.g., $s \approx 1$ implies $O(n)$, $s \approx 2$ implies $O(n^2)$). The fit quality is measured by $R^2$, with an LLM fallback when $R^2$ is low. The efficiency metric extends the standard pass@k: \text{eff@k} = \frac{|\{i : \text{correct}_i \wedge T_i \leq T_{limit} \wedge M_i \leq M_{limit}\}|}{k} A controller enforces iteration caps and diminishing-returns stopping criteria across the agent loop. ===== AetherCode Benchmark ===== AetherCode addresses limitations in prior benchmarks by using harder problems from IOI and ICPC competitions with expert-validated test suites. Problems are categorized by difficulty (Easy, Medium, Hard, Extreme) and recency (2024, 2025). ^ Model ^ Easy ^ Medium ^ Hard ^ Extreme ^ Pass@1 ^ | o4-mini-high | 65.3% | 32.1% | 8.0% | 3.8% | 35.5% | | Gemini-2.5-Pro | 60.1% | 28.6% | 8.5% | 2.5% | 32.7% | | GPT-4.1 | 23.9% | 5.7% | 1.1% | 0% | 10.5% | Even frontier reasoning models achieve only ~35% Pass@1, far below human expert performance on elite problems. ===== Code Example =====


import json
import subprocess
import numpy as np
from dataclasses import dataclass

@dataclass
class ProfileResult:
    runtime_ms: float
    memory_kb: float
    correct: bool

def fit_complexity(input_sizes: list[int], runtimes: list[float]) -> tuple[float, float]:
    log_n = np.log(input_sizes)
    log_t = np.log(runtimes)
    slope, intercept = np.polyfit(log_n, log_t, 1)
    ss_res = np.sum((log_t - (slope * log_n + intercept)) ** 2)
    ss_tot = np.sum((log_t - np.mean(log_t)) ** 2)
    r_squared = 1 - ss_res / ss_tot
    return slope, r_squared

def evaluate_solution(code: str, test_cases: list[dict],
                      time_limit_ms: float = 2000,
                      memory_limit_kb: float = 262144) -> dict:
    results = []
    for tc in test_cases:
        proc = subprocess.run(
            ["./sandbox", "--time", str(time_limit_ms),
             "--mem", str(memory_limit_kb)],
            input=tc["input"], capture_output=True, text=True
        )
        results.append(ProfileResult(
            runtime_ms=float(proc.stderr.split(",")[0]),
            memory_kb=float(proc.stderr.split(",")[1]),
            correct=(proc.stdout.strip() == tc["expected"])
        ))
    pass_rate = sum(r.correct for r in results) / len(results)
    eff_rate = sum(
        r.correct and r.runtime_ms <= time_limit_ms
        and r.memory_kb <= memory_limit_kb for r in results
    ) / len(results)
    return {"pass@1": pass_rate, "eff@1": eff_rate}

===== Architecture ===== graph TD A[Problem Statement] --> B[Planner Agent] B --> C{Static Pruner} C -->|Approved| D[Coder Agent - C++17] C -->|Rejected| B D --> E[Profiler Agent] E --> F[Complexity Analyst] F --> G{Complexity OK?} G -->|Yes| H{Correctness Check} G -->|No - Algorithmic Issue| B G -->|No - Implementation Issue| D H -->|Pass| I[Submit Solution] H -->|Fail| D E --> J[Log-Log Regression] J --> F subgraph Feedback Loop K[Controller] --> L{Diminishing Returns?} L -->|No| B L -->|Yes| M[Best Solution] end ===== Key Results ===== ^ System ^ Metric ^ Value ^ | SwiftSolve | Pass@1 | 61.54% (16/26 problems) | | SwiftSolve | Solved@<=3 | 80.77% | | SwiftSolve | Run-level success | 73.08% vs 52.6% (Claude Opus baseline) | | SwiftSolve | Mean runtime overhead | 12.4s vs 6.8s (single agent) | | AetherCode | Best Pass@1 (Extreme) | 3.8% (o4-mini-high) | ===== See Also ===== * [[software_testing_agents|Software Testing Agents]] * [[budget_aware_reasoning|Budget-Aware Reasoning]] * [[game_playing_agents|Game Playing Agents]] ===== References =====