Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Multi-agent LLM systems for DevOps incident response automate the full incident lifecycle from failure detection through diagnosis and mitigation, with formal safety specifications enabling autonomous reliability engineering at cloud scale.
In cloud-scale systems, failures are the norm: distributed computing clusters exhibit hundreds of machine failures and thousands of disk failures, with software bugs and misconfigurations being even more frequent. Human-in-the-loop Site Reliability Engineering (SRE) practices cannot keep pace with modern cloud scale. Multi-agent incident response systems and STRATUS address this by deploying specialized LLM agents organized in state machines for autonomous, safety-aware failure mitigation.
STRATUS is an LLM-based multi-agent system consisting of specialized agents organized in a state machine:
Toolset: STRATUS integrates NL2Kubectl (natural language to Kubernetes commands), linting/oracles for validation, and telemetry collection across services.
STRATUS formalizes a key safety specification called Transactional No-Regression (TNR), which enables safe exploration and iteration:
<latex>\text{TNR}: \forall a \in \mathcal{A}_{mitigation}, \quad H(s_{after}(a)) \geq H(s_{before}(a))</latex>
where $H(s)$ is a system health function, $s_{before}$ and $s_{after}$ are system states before and after action $a$, and $\mathcal{A}_{mitigation}$ is the set of mitigation actions. TNR guarantees that no mitigation action makes the system worse than its pre-action state.
This is enforced through:
The four-stage incident lifecycle automated by LLM agents:
<latex>\text{MTTR}_{agent} = t_{detect} + t_{diagnose} + t_{mitigate} \ll \text{MTTR}_{human}</latex>
from dataclasses import dataclass from enum import Enum class IncidentState(Enum): DETECTING = "detecting" DIAGNOSING = "diagnosing" MITIGATING = "mitigating" VALIDATING = "validating" RESOLVED = "resolved" ROLLED_BACK = "rolled_back" @dataclass class SystemHealth: error_rate: float latency_p99: float availability: float def score(self) -> float: return (self.availability * 0.5 - self.error_rate * 0.3 - self.latency_p99 * 0.2) class StratusAgent: def __init__(self, llm, k8s_client, telemetry): self.llm = llm self.k8s = k8s_client self.telemetry = telemetry self.state = IncidentState.DETECTING def detect_failure(self, alerts: list[dict]) -> dict: self.state = IncidentState.DETECTING context = self.telemetry.gather_context(alerts) diagnosis = self.llm.generate( f"Analyze these alerts and telemetry:\n{context}\n" f"Identify the likely failure and affected services." ) return {"alerts": alerts, "diagnosis": diagnosis} def mitigate_with_tnr(self, diagnosis: dict) -> bool: self.state = IncidentState.MITIGATING health_before = self.measure_health() action = self.llm.generate( f"Generate kubectl mitigation for:\n{diagnosis}" ) self.k8s.execute(action) health_after = self.measure_health() if health_after.score() < health_before.score(): self.k8s.rollback(action) self.state = IncidentState.ROLLED_BACK return False self.state = IncidentState.RESOLVED return True
| Metric | STRATUS | State-of-the-Art Baselines |
|---|---|---|
| Mitigation success rate | 1.5x higher (at least) | Baseline |
| Benchmark coverage | AIOpsLab + ITBench | Varies |
| Safety guarantee | TNR-enforced | None |
| Model flexibility | Multiple LLM backends | Often single model |
| Rollback capability | Automatic via Undo Agent | Manual |