This is an old revision of the document!

DevOps Incident Agents

Multi-agent LLM systems for DevOps incident response automate the full incident lifecycle from failure detection through diagnosis and mitigation, with formal safety specifications enabling autonomous reliability engineering at cloud scale.

Overview

In cloud-scale systems, failures are the norm: distributed computing clusters exhibit hundreds of machine failures and thousands of disk failures, with software bugs and misconfigurations being even more frequent. Human-in-the-loop Site Reliability Engineering (SRE) practices cannot keep pace with modern cloud scale. Multi-agent incident response systems and STRATUS address this by deploying specialized LLM agents organized in state machines for autonomous, safety-aware failure mitigation.

STRATUS: Autonomous Site Reliability Engineering

STRATUS is an LLM-based multi-agent system consisting of specialized agents organized in a state machine:

Detection Agent: Monitors observability telemetry (logs, traces, metrics, system states) and identifies anomalies
Diagnosis Agent: Performs failure localization and root cause analysis (RCA) using system context
Mitigation Agent: Executes remediation actions through the cloud control plane
Undo Agent: Provides rollback capabilities for safe exploration

Toolset: STRATUS integrates NL2Kubectl (natural language to Kubernetes commands), linting/oracles for validation, and telemetry collection across services.

Transactional No-Regression (TNR)

STRATUS formalizes a key safety specification called Transactional No-Regression (TNR), which enables safe exploration and iteration:

<latex>\text{TNR}: \forall a \in \mathcal{A}_{mitigation}, \quad H(s_{after}(a)) \geq H(s_{before}(a))</latex>

where $H(s)$ is a system health function, $s_{before}$ and $s_{after}$ are system states before and after action $a$, and $\mathcal{A}_{mitigation}$ is the set of mitigation actions. TNR guarantees that no mitigation action makes the system worse than its pre-action state.

This is enforced through:

Pre-action health checks: Alert clearance verification, system health baseline, workload assessment
Post-action validation: Comparing health metrics after mitigation
Automatic rollback: Undo agent reverses actions that violate TNR

Incident Lifecycle

The four-stage incident lifecycle automated by LLM agents:

Alert Verification and Log Retrieval: System automatically consumes alerts and gathers context via APIs
Knowledge Retrieval: AI accesses historical incident tickets, system logs, and runbooks via RAG
Reasoning and Root Cause Analysis: LLM correlates error messages with real-time logs
Action and Communication: System executes remediation and drafts notifications

<latex>\text{MTTR}_{agent} = t_{detect} + t_{diagnose} + t_{mitigate} \ll \text{MTTR}_{human}</latex>

Code Example

from dataclasses import dataclass
from enum import Enum
 
class IncidentState(Enum):
    DETECTING = "detecting"
    DIAGNOSING = "diagnosing"
    MITIGATING = "mitigating"
    VALIDATING = "validating"
    RESOLVED = "resolved"
    ROLLED_BACK = "rolled_back"
 
@dataclass
class SystemHealth:
    error_rate: float
    latency_p99: float
    availability: float
 
    def score(self) -> float:
        return (self.availability * 0.5
                - self.error_rate * 0.3
                - self.latency_p99 * 0.2)
 
class StratusAgent:
    def __init__(self, llm, k8s_client, telemetry):
        self.llm = llm
        self.k8s = k8s_client
        self.telemetry = telemetry
        self.state = IncidentState.DETECTING
 
    def detect_failure(self, alerts: list[dict]) -> dict:
        self.state = IncidentState.DETECTING
        context = self.telemetry.gather_context(alerts)
        diagnosis = self.llm.generate(
            f"Analyze these alerts and telemetry:\n{context}\n"
            f"Identify the likely failure and affected services."
        )
        return {"alerts": alerts, "diagnosis": diagnosis}
 
    def mitigate_with_tnr(self, diagnosis: dict) -> bool:
        self.state = IncidentState.MITIGATING
        health_before = self.measure_health()
        action = self.llm.generate(
            f"Generate kubectl mitigation for:\n{diagnosis}"
        )
        self.k8s.execute(action)
        health_after = self.measure_health()
        if health_after.score() < health_before.score():
            self.k8s.rollback(action)
            self.state = IncidentState.ROLLED_BACK
            return False
        self.state = IncidentState.RESOLVED
        return True

Architecture

stateDiagram-v2 [*] --> Detection Detection --> Diagnosis: Failure Identified Diagnosis --> Mitigation: RCA Complete Mitigation --> Validation: Action Executed Validation --> Resolved: TNR Satisfied Validation --> Rollback: TNR Violated Rollback --> Diagnosis: Re-diagnose Resolved --> [*] state Detection { [*] --> AlertMonitor AlertMonitor --> TelemetryCollection TelemetryCollection --> AnomalyDetection } state Diagnosis { [*] --> LogAnalysis LogAnalysis --> RootCauseAnalysis RootCauseAnalysis --> ActionPlanning } state Mitigation { [*] --> NL2Kubectl NL2Kubectl --> Linting Linting --> Execution }

Key Results

Metric	STRATUS	State-of-the-Art Baselines
Mitigation success rate	1.5x higher (at least)	Baseline
Benchmark coverage	AIOpsLab + ITBench	Varies
Safety guarantee	TNR-enforced	None
Model flexibility	Multiple LLM backends	Often single model
Rollback capability	Automatic via Undo Agent	Manual

AI Agent Knowledge Base

Sidebar

Table of Contents

DevOps Incident Agents

Overview

STRATUS: Autonomous Site Reliability Engineering

Transactional No-Regression (TNR)

Incident Lifecycle

Code Example

Architecture

Key Results

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

DevOps Incident Agents

Overview

STRATUS: Autonomous Site Reliability Engineering

Transactional No-Regression (TNR)

Incident Lifecycle

Code Example

Architecture

Key Results

References

See Also

Page Tools