This is an old revision of the document!

Explainable AI (XAI)

Explainable AI (XAI) refers to the processes, methods, and techniques that make the outputs and internal decisions of artificial intelligence systems understandable and interpretable by humans. XAI directly addresses the “black box” problem — the opacity of modern machine learning models, particularly deep neural networks, whose internal reasoning is not directly observable.

Definition and Importance

A model is considered explainable when a human stakeholder can understand why it produced a given output, not just what the output was. This distinction is critical across a growing range of applications.

Trust and adoption. Practitioners and end-users are less likely to deploy or rely on systems they cannot interrogate. Explainability builds warranted confidence and allows operators to identify failure modes before deployment.

Bias detection and fairness. XAI tools can reveal that a model has learned spurious or discriminatory correlations. In 2021 the Dutch financial regulator fined an unnamed bank approximately 47 million euros after an audit, aided by model explanation techniques, revealed that an automated credit-scoring system systematically disadvantaged applicants from certain postal codes.¹⁾

Regulatory compliance. Legislation in multiple jurisdictions now mandates or incentivises explainability for high-stakes automated decisions (see Regulatory Drivers below).

Accountability and auditability. When AI systems make consequential decisions — in medicine, criminal justice, lending, or hiring — stakeholders require a trail of reasoning that can be reviewed, challenged, and corrected.

Debugging and model improvement. Explanations expose which features drive predictions, enabling engineers to detect data leakage, overfitting, or distribution shift that aggregate metrics such as accuracy cannot reveal.

Key Techniques

LIME (Local Interpretable Model-Agnostic Explanations)

LIME, introduced by Ribeiro et al. in 2016, approximates any black-box model locally around a single prediction using a simpler, interpretable surrogate — typically a sparse linear model.²⁾ The technique perturbs the input, observes the model's response, and fits the surrogate to that local neighbourhood. Because LIME operates on model inputs and outputs only, it is model-agnostic and works with images, text, and tabular data.

SHAP (SHapley Additive exPlanations)

SHAP, introduced by Lundberg and Lee in 2017, assigns each feature a contribution value derived from Shapley values — a solution concept from cooperative game theory that guarantees a set of desirable mathematical properties: local accuracy, missingness, and consistency.³⁾ SHAP values are the unique additive feature attributions satisfying these axioms. Efficient implementations such as TreeSHAP allow exact computation for tree-based models; KernelSHAP provides a model-agnostic approximation. SHAP has become the de facto standard for feature attribution in industry.

Attention Visualization

Transformer-based models expose attention weights, which can be visualised as heatmaps to indicate which tokens a model “attended to” when producing an output. Tools such as BertViz⁴⁾ render these weights interactively. However, the field has debated whether attention constitutes a genuine explanation: Jain and Wallace (2019) demonstrated that attention distributions can be permuted without changing predictions, challenging the assumption that high attention implies causal importance.⁵⁾ Subsequent work by Wiegreffe and Pinter partially rebutted these claims, leaving the debate open.

TCAV (Testing with Concept Activation Vectors)

Proposed by Kim et al. (2018), TCAV probes the internal representations of a neural network for human-defined concepts — e.g., “stripes” or “medical instrument” — rather than relying on input features.⁶⁾ A linear classifier is trained to separate activations of a target layer from those associated with the concept, yielding a Concept Activation Vector (CAV). The method produces a score measuring how sensitive a class prediction is to that concept.

Concept Bottleneck Models

Koh et al. (2020) introduced Concept Bottleneck Models (CBMs), an architecture that forces the network to first predict a set of human-interpretable concepts and then use those concepts — rather than raw features — to produce the final label.⁷⁾ This design allows interventions at test time: a clinician can override a predicted concept to explore counterfactuals. The trade-off is that concept quality bounds model accuracy.

Integrated Gradients

Integrated Gradients (IG), introduced by Sundararajan et al. (2017), attribute a prediction to each input feature by integrating the model's gradients along a straight-line path from a baseline (e.g., a black image or zero embedding) to the actual input.⁸⁾ IG satisfies two axioms — Sensitivity and Implementation Invariance — that many simpler gradient methods violate. It is widely used for attributions in computer vision and NLP without requiring model modification.

Regulatory Drivers

Governments and standards bodies have moved to codify explainability requirements. The table below summarises key instruments as of early 2026.

Instrument	Jurisdiction	Key Requirement	Penalty
EU AI Act (in force Aug 2026)	European Union	Article 50 mandates transparency obligations for certain AI systems; high-risk systems must provide logs and explanations sufficient for human oversight	Up to €35 million or 7 % of global turnover
GDPR Article 22	European Union	Individuals subject to solely automated decisions have the right to obtain “meaningful information about the logic involved”	Up to €20 million or 4 % of global turnover
US Executive Order on AI (Oct 2023)	United States	Requires federal agencies to assess explainability of AI used in critical decisions; NIST AI RMF guidance referenced	Varies by agency
OSFI Guideline B-15	Canada	Financial institutions must be able to explain AI-driven decisions to affected individuals	Supervisory action
MAS FEAT Principles	Singapore	Fairness, Ethics, Accountability, Transparency principles require explainability for financial AI	Supervisory action

The EU AI Act is the most consequential current instrument: its definition of high-risk AI systems (Annex III) covers biometric identification, critical infrastructure, education, employment, essential services, law enforcement, migration, and administration of justice — each area where XAI is now a compliance requirement rather than a best practice.

XAI for LLMs and Agents

Large language models (LLMs) and autonomous agents present qualitatively harder explainability challenges than classical ML models:

Scale. Models with hundreds of billions of parameters resist the feature-attribution techniques designed for shallow networks or tabular data.
Generative output. Unlike classification, generated text lacks a single scalar output against which to compute gradients or Shapley values in a tractable way.
Multi-step reasoning. Agent behaviour emerges from sequences of actions; explaining any single step may miss emergent dynamics across the trajectory.

Current approaches include:

Probing classifiers. Linear classifiers trained on internal activations test whether specific concepts (syntax, world facts, sentiment) are linearly decodable from a given layer. Probing reveals where knowledge is encoded but does not fully explain how it is used.

Chain-of-Thought (CoT) as explanation. Prompting LLMs to produce step-by-step reasoning (Wei et al., 2022) yields human-readable rationales. Whether these faithfully reflect the model's underlying computation remains an open research question — models can produce plausible CoT that diverges from their actual inference path.

MExGen (IBM, ACL 2025). IBM Research presented MExGen at ACL 2025, a framework for Modular Explanation Generation that decomposes LLM explanations into verifiable sub-claims and grounds each claim against retrieved evidence, combining XAI with retrieval-augmented generation to improve faithfulness.

Mechanistic interpretability. Research groups including Anthropic's interpretability team pursue circuit-level analysis — identifying the minimal sub-graph of attention heads and MLP layers responsible for a specific behaviour. Anthropic has published findings on induction heads, indirect object identification circuits, and superposition, treating the model as a system to be reverse-engineered rather than merely probed from outside.

XAI Tools

Tool	Maintainer	Techniques	Model Types	License
SHAP	Community (Lundberg)	Shapley values, TreeSHAP, KernelSHAP, DeepSHAP	Tree, deep, linear, model-agnostic	MIT
LIME	Community (Ribeiro)	Local surrogate models	Model-agnostic (tabular, text, image)	BSD
InterpretML	Microsoft	EBM, SHAP, LIME, LIME-S	Tabular focus	MIT
AIX360	IBM Research	LIME, SHAP, BRCG, RBFN, TED, ProfWeight	Broad	Apache 2.0
Captum	Meta AI	IG, GradCAM, SHAP, TCAV, LayerConductance	PyTorch models	BSD
BertViz	Jesse Vig	Attention head visualisation	Transformers (HuggingFace)	Apache 2.0
What-If Tool	Google PAIR	Counterfactuals, partial dependence, fairness slices	TensorFlow, Scikit-learn	Apache 2.0

AI Agent Knowledge Base

Sidebar

Table of Contents

Explainable AI (XAI)

Definition and Importance

Key Techniques

LIME (Local Interpretable Model-Agnostic Explanations)

SHAP (SHapley Additive exPlanations)

Attention Visualization

TCAV (Testing with Concept Activation Vectors)

Concept Bottleneck Models

Integrated Gradients

Regulatory Drivers

XAI for LLMs and Agents

XAI Tools

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Explainable AI (XAI)

Definition and Importance

Key Techniques

LIME (Local Interpretable Model-Agnostic Explanations)

SHAP (SHapley Additive exPlanations)

Attention Visualization

TCAV (Testing with Concept Activation Vectors)

Concept Bottleneck Models

Integrated Gradients

Regulatory Drivers

XAI for LLMs and Agents

XAI Tools

See Also

Page Tools