====== DevOps Incident Agents ====== Multi-agent LLM systems for DevOps incident response automate the full incident lifecycle from failure detection through diagnosis and mitigation, with formal safety specifications enabling autonomous reliability engineering at cloud scale. ===== Overview ===== In cloud-scale systems, failures are the norm: distributed computing clusters exhibit hundreds of machine failures and thousands of disk failures, with software bugs and misconfigurations being even more frequent. Human-in-the-loop Site Reliability Engineering (SRE) practices cannot keep pace with modern cloud scale. Multi-agent incident response systems and STRATUS address this by deploying specialized LLM agents organized in state machines for autonomous, safety-aware failure mitigation.(([[https://arxiv.org/abs/2511.15755|"Multi-Agent Incident Response with LLM Agents" (2025)]])) ===== STRATUS: Autonomous Site Reliability Engineering ===== STRATUS is an LLM-based multi-agent system consisting of specialized agents organized in a state machine:(([[https://arxiv.org/abs/2506.02009|Chen et al. "STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds" (2025)]])) * **Detection Agent**: Monitors observability telemetry (logs, traces, metrics, system states) and identifies anomalies * **Diagnosis Agent**: Performs failure localization and root cause analysis (RCA) using system context * **Mitigation Agent**: Executes remediation actions through the cloud control plane * **Undo Agent**: Provides rollback capabilities for safe exploration **Toolset**: STRATUS integrates NL2Kubectl (natural language to Kubernetes commands), linting/oracles for validation, and telemetry collection across services. ===== Transactional No-Regression (TNR) ===== STRATUS formalizes a key safety specification called Transactional No-Regression (TNR), which enables safe exploration and iteration: \text{TNR}: \forall a \in \mathcal{A}_{mitigation}, \quad H(s_{after}(a)) \geq H(s_{before}(a)) where $H(s)$ is a system health function, $s_{before}$ and $s_{after}$ are system states before and after action $a$, and $\mathcal{A}_{mitigation}$ is the set of mitigation actions. TNR guarantees that no mitigation action makes the system worse than its pre-action state. This is enforced through: * **Pre-action health checks**: Alert clearance verification, system health baseline, workload assessment * **Post-action validation**: Comparing health metrics after mitigation * **Automatic rollback**: Undo agent reverses actions that violate TNR ===== Incident Lifecycle ===== The four-stage incident lifecycle automated by LLM agents: - **Alert Verification and Log Retrieval**: System automatically consumes alerts and gathers context via APIs - **Knowledge Retrieval**: AI accesses historical incident tickets, system logs, and runbooks via RAG - **Reasoning and Root Cause Analysis**: LLM correlates error messages with real-time logs - **Action and Communication**: System executes remediation and drafts notifications \text{MTTR}_{agent} = t_{detect} + t_{diagnose} + t_{mitigate} \ll \text{MTTR}_{human} ===== Code Example ===== from dataclasses import dataclass from enum import Enum class IncidentState(Enum): DETECTING = "detecting" DIAGNOSING = "diagnosing" MITIGATING = "mitigating" VALIDATING = "validating" RESOLVED = "resolved" ROLLED_BACK = "rolled_back" @dataclass class SystemHealth: error_rate: float latency_p99: float availability: float def score(self) -> float: return (self.availability * 0.5 - self.error_rate * 0.3 - self.latency_p99 * 0.2) class StratusAgent: def __init__(self, llm, k8s_client, telemetry): self.llm = llm self.k8s = k8s_client self.telemetry = telemetry self.state = IncidentState.DETECTING def detect_failure(self, alerts: list[dict]) -> dict: self.state = IncidentState.DETECTING context = self.telemetry.gather_context(alerts) diagnosis = self.llm.generate( f"Analyze these alerts and telemetry:\n{context}\n" f"Identify the likely failure and affected services." ) return {"alerts": alerts, "diagnosis": diagnosis} def mitigate_with_tnr(self, diagnosis: dict) -> bool: self.state = IncidentState.MITIGATING health_before = self.measure_health() action = self.llm.generate( f"Generate kubectl mitigation for:\n{diagnosis}" ) self.k8s.execute(action) health_after = self.measure_health() if health_after.score() < health_before.score(): self.k8s.rollback(action) self.state = IncidentState.ROLLED_BACK return False self.state = IncidentState.RESOLVED return True ===== Architecture ===== stateDiagram-v2 [*] --> Detection Detection --> Diagnosis: Failure Identified Diagnosis --> Mitigation: RCA Complete Mitigation --> Validation: Action Executed Validation --> Resolved: TNR Satisfied Validation --> Rollback: TNR Violated Rollback --> Diagnosis: Re-diagnose Resolved --> [*] state Detection { [*] --> AlertMonitor AlertMonitor --> TelemetryCollection TelemetryCollection --> AnomalyDetection } state Diagnosis { [*] --> LogAnalysis LogAnalysis --> RootCauseAnalysis RootCauseAnalysis --> ActionPlanning } state Mitigation { [*] --> NL2Kubectl NL2Kubectl --> Linting Linting --> Execution } ===== Key Results ===== ^ Metric ^ STRATUS ^ State-of-the-Art Baselines ^ | Mitigation success rate | 1.5x higher (at least) | Baseline | | Benchmark coverage | AIOpsLab + ITBench | Varies | | Safety guarantee | TNR-enforced | None | | Model flexibility | Multiple LLM backends | Often single model | | Rollback capability | Automatic via Undo Agent | Manual | ===== See Also ===== * [[database_tuning_agents|Database Tuning Agents]] * [[software_testing_agents|Software Testing Agents]] * [[financial_trading_agents|Financial Trading Agents]] ===== References =====