====== DevOps Incident Agents ======

Multi-agent LLM systems for DevOps incident response automate the full incident lifecycle from failure detection through diagnosis and mitigation, with formal safety specifications enabling autonomous reliability engineering at cloud scale.

===== Overview =====

In cloud-scale systems, failures are the norm: distributed computing clusters exhibit hundreds of machine failures and thousands of disk failures, with software bugs and misconfigurations being even more frequent. Human-in-the-loop Site Reliability Engineering (SRE) practices cannot keep pace with modern cloud scale. Multi-agent incident response systems and STRATUS address this by deploying specialized LLM agents organized in state machines for autonomous, safety-aware failure mitigation.(([[https://arxiv.org/abs/2511.15755|"Multi-Agent Incident Response with LLM Agents" (2025)]]))

===== STRATUS: Autonomous Site Reliability Engineering =====

STRATUS is an LLM-based multi-agent system consisting of specialized agents organized in a state machine:(([[https://arxiv.org/abs/2506.02009|Chen et al. "STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds" (2025)]]))

  * **Detection Agent**: Monitors observability telemetry (logs, traces, metrics, system states) and identifies anomalies
  * **Diagnosis Agent**: Performs failure localization and root cause analysis (RCA) using system context
  * **Mitigation Agent**: Executes remediation actions through the cloud control plane
  * **Undo Agent**: Provides rollback capabilities for safe exploration

**Toolset**: STRATUS integrates NL2Kubectl (natural language to Kubernetes commands), linting/oracles for validation, and telemetry collection across services.

===== Transactional No-Regression (TNR) =====

STRATUS formalizes a key safety specification called Transactional No-Regression (TNR), which enables safe exploration and iteration:

<latex>\text{TNR}: \forall a \in \mathcal{A}_{mitigation}, \quad H(s_{after}(a)) \geq H(s_{before}(a))</latex>

where $H(s)$ is a system health function, $s_{before}$ and $s_{after}$ are system states before and after action $a$, and $\mathcal{A}_{mitigation}$ is the set of mitigation actions. TNR guarantees that no mitigation action makes the system worse than its pre-action state.

This is enforced through:
  * **Pre-action health checks**: Alert clearance verification, system health baseline, workload assessment
  * **Post-action validation**: Comparing health metrics after mitigation
  * **Automatic rollback**: Undo agent reverses actions that violate TNR

===== Incident Lifecycle =====

The four-stage incident lifecycle automated by LLM agents:

  - **Alert Verification and Log Retrieval**: System automatically consumes alerts and gathers context via APIs
  - **Knowledge Retrieval**: AI accesses historical incident tickets, system logs, and runbooks via RAG
  - **Reasoning and Root Cause Analysis**: LLM correlates error messages with real-time logs
  - **Action and Communication**: System executes remediation and drafts notifications

<latex>\text{MTTR}_{agent} = t_{detect} + t_{diagnose} + t_{mitigate} \ll \text{MTTR}_{human}</latex>

===== Code Example =====

<code python>
from dataclasses import dataclass
from enum import Enum

class IncidentState(Enum):
    DETECTING = "detecting"
    DIAGNOSING = "diagnosing"
    MITIGATING = "mitigating"
    VALIDATING = "validating"
    RESOLVED = "resolved"
    ROLLED_BACK = "rolled_back"

@dataclass
class SystemHealth:
    error_rate: float
    latency_p99: float
    availability: float

    def score(self) -> float:
        return (self.availability * 0.5
                - self.error_rate * 0.3
                - self.latency_p99 * 0.2)

class StratusAgent:
    def __init__(self, llm, k8s_client, telemetry):
        self.llm = llm
        self.k8s = k8s_client
        self.telemetry = telemetry
        self.state = IncidentState.DETECTING

    def detect_failure(self, alerts: list[dict]) -> dict:
        self.state = IncidentState.DETECTING
        context = self.telemetry.gather_context(alerts)
        diagnosis = self.llm.generate(
            f"Analyze these alerts and telemetry:\n{context}\n"
            f"Identify the likely failure and affected services."
        )
        return {"alerts": alerts, "diagnosis": diagnosis}

    def mitigate_with_tnr(self, diagnosis: dict) -> bool:
        self.state = IncidentState.MITIGATING
        health_before = self.measure_health()
        action = self.llm.generate(
            f"Generate kubectl mitigation for:\n{diagnosis}"
        )
        self.k8s.execute(action)
        health_after = self.measure_health()
        if health_after.score() < health_before.score():
            self.k8s.rollback(action)
            self.state = IncidentState.ROLLED_BACK
            return False
        self.state = IncidentState.RESOLVED
        return True
</code>

===== Architecture =====

<mermaid>
stateDiagram-v2
    [*] --> Detection
    Detection --> Diagnosis: Failure Identified
    Diagnosis --> Mitigation: RCA Complete
    Mitigation --> Validation: Action Executed
    Validation --> Resolved: TNR Satisfied
    Validation --> Rollback: TNR Violated
    Rollback --> Diagnosis: Re-diagnose
    Resolved --> [*]

    state Detection {
        [*] --> AlertMonitor
        AlertMonitor --> TelemetryCollection
        TelemetryCollection --> AnomalyDetection
    }

    state Diagnosis {
        [*] --> LogAnalysis
        LogAnalysis --> RootCauseAnalysis
        RootCauseAnalysis --> ActionPlanning
    }

    state Mitigation {
        [*] --> NL2Kubectl
        NL2Kubectl --> Linting
        Linting --> Execution
    }
</mermaid>

===== Key Results =====

^ Metric ^ STRATUS ^ State-of-the-Art Baselines ^
| Mitigation success rate | 1.5x higher (at least) | Baseline |
| Benchmark coverage | AIOpsLab + ITBench | Varies |
| Safety guarantee | TNR-enforced | None |
| Model flexibility | Multiple LLM backends | Often single model |
| Rollback capability | Automatic via Undo Agent | Manual |

===== See Also =====

  * [[database_tuning_agents|Database Tuning Agents]]
  * [[software_testing_agents|Software Testing Agents]]
  * [[financial_trading_agents|Financial Trading Agents]]

===== References =====