Causal Reasoning Agents: Causal-Copilot

Causal analysis – determining what causes what – is one of the most important yet technically demanding tasks in data science. Causal-Copilot (2025) is an LLM-powered autonomous agent that automates the entire causal analysis pipeline: from data ingestion and causal discovery through identification, estimation, and interpretation, all driven by natural language interaction.

End-to-End Causal Analysis Pipeline

Causal-Copilot operates through a modular pipeline where each stage is orchestrated by the LLM agent:

1. User Interaction: The user uploads data and specifies causal questions in natural language. The system parses queries, incorporates domain knowledge, and supports interactive feedback at every stage.

2. Preprocessing: Automatic data cleaning, schema extraction, and diagnostic analysis including tests for linearity, stationarity, and heterogeneity across subpopulations.

3. Algorithm Selection: The LLM evaluates data characteristics and selects from 20+ algorithms, then configures hyperparameters. This replaces the traditional expert-driven process of manually choosing between methods.

4. Core Analysis: Executes the selected algorithms for causal discovery, causal inference, and auxiliary analyses.

5. Postprocessing: Bootstrap evaluation for robustness, LLM-guided graph refinement, and support for user revisions to the causal graph.

6. Report Generation: Produces visualizations, natural language interpretations, and LaTeX reports.

Supported Causal Methods

Causal-Copilot integrates methods across the full spectrum of causal analysis:

Causal Discovery (Graph Structure Learning):

Family	Methods
Constraint-based	PC, FCI (handles latent confounders)
Score-based	GES (Greedy Equivalence Search)
Optimization-based	NOTEARS (continuous optimization for DAGs)
Functional	LiNGAM family (non-Gaussian identification)

The NOTEARS optimization formulates DAG learning as a continuous problem:

$$\min_{W} \frac{1}{2n} \|X - XW\|_F^2 + \lambda \|W\|_1 \quad \text{s.t.} \quad h(W) = 0$$

where $h(W) = \text{tr}(e^{W \circ W}) - d$ is the acyclicity constraint.

Causal Inference (Effect Estimation):

Double Machine Learning (DML)
Doubly Robust estimation
Instrumental Variables (IV, DRIV)
Propensity Score Matching (PSM)
Counterfactual estimation

Auxiliary Analysis:

SHAP feature importance
Anomaly attribution

Code Example: Causal Analysis Agent

class CausalCopilot:
    def __init__(self, llm, method_registry):
        self.llm = llm
        self.methods = method_registry
 
    def analyze(self, data, question):
        diagnostics = self.preprocess(data)
        selected_methods = self.select_algorithms(diagnostics, question)
        causal_graph = self.discover(data, selected_methods["discovery"])
        causal_graph = self.refine_graph(causal_graph, question)
        effects = self.estimate(
            data, causal_graph, 
            selected_methods["inference"],
            treatment=question.treatment,
            outcome=question.outcome
        )
        robust = self.bootstrap_evaluate(effects, n_iterations=500)
        report = self.generate_report(causal_graph, robust, question)
        return report
 
    def select_algorithms(self, diagnostics, question):
        prompt = self.build_selection_prompt(diagnostics, question)
        selection = self.llm.reason(prompt)
        return {
            "discovery": self.methods.get(selection.discovery_method),
            "inference": self.methods.get(selection.inference_method)
        }
 
    def discover(self, data, method):
        raw_graph = method.fit(data)
        bootstrap_graphs = [
            method.fit(data.sample(frac=0.8))
            for _ in range(100)
        ]
        edge_confidence = self.compute_edge_stability(bootstrap_graphs)
        return self.prune_unstable_edges(raw_graph, edge_confidence)
 
    def refine_graph(self, graph, question):
        refinement = self.llm.evaluate_graph(graph, question.domain)
        return graph.apply_refinements(refinement)

Benchmark Results

Causal-Copilot consistently outperforms individual algorithms across diverse scenarios:

Tabular Data (F1 Score):

Scenario	Causal-Copilot	PC	FCI	GES
Dense Graph (p=0.5)	0.65	0.41	0.44	0.40
Large Scale (p=50)	0.94	0.70	0.79	N/A
Non-Gaussian Noise	0.97	0.84	0.85	0.86
Heterogeneous Domains	0.77	0.51	0.62	0.40

Time Series Data (F1 Score):

Scenario	Causal-Copilot	PCMCI	DYNOTEARS
Small (p=5, lag=3)	0.98	0.92	0.97
Large Lag (lag=20)	0.85	0.84	0.77

The agent excels especially in challenging scenarios (extreme scale, non-Gaussian noise, heterogeneous domains) where algorithm selection is critical.

Pipeline Diagram

flowchart TD A[User: Data + Natural Language Question] --> B[Preprocessing Agent] B --> C[Data Cleaning & Diagnostics] C --> D[Algorithm Selection Agent] D --> E[Method Configuration] E --> F{Analysis Type} F --> G[Causal Discovery] F --> H[Causal Inference] F --> I[Auxiliary Analysis] G --> J[Graph: PC / FCI / GES / NOTEARS / LiNGAM] H --> K[Effects: DML / DR / IV / PSM] I --> L[SHAP / Anomaly Attribution] J --> M[Postprocessing Agent] K --> M L --> M M --> N[Bootstrap Evaluation] N --> O[LLM Graph Refinement] O --> P[Report Generation] P --> Q[Visualizations + LaTeX Report]

Key Capabilities

Natural language interface: No statistical expertise required – users describe causal questions in plain English
Automatic method selection: The LLM chooses appropriate algorithms based on data characteristics, eliminating the need for manual algorithm comparison
Scalability: Handles datasets with up to 500 variables and complex time-series with long lags
Robustness: Bootstrap evaluation and graph refinement ensure reliable results
Interpretability: Generated reports explain findings in accessible language with supporting visualizations

Table of Contents