Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Explainable AI (XAI) refers to the processes, methods, and techniques that make the outputs and internal decisions of artificial intelligence systems understandable and interpretable by humans. XAI directly addresses the “black box” problem — the opacity of modern machine learning models, particularly deep neural networks, whose internal reasoning is not directly observable.
A model is considered explainable when a human stakeholder can understand why it produced a given output, not just what the output was. This distinction is critical across a growing range of applications.
Trust and adoption. Practitioners and end-users are less likely to deploy or rely on systems they cannot interrogate. Explainability builds warranted confidence and allows operators to identify failure modes before deployment.
Bias detection and fairness. XAI tools can reveal that a model has learned spurious or discriminatory correlations. In 2021 the Dutch financial regulator fined an unnamed bank approximately 47 million euros after an audit, aided by model explanation techniques, revealed that an automated credit-scoring system systematically disadvantaged applicants from certain postal codes.1)
Regulatory compliance. Legislation in multiple jurisdictions now mandates or incentivises explainability for high-stakes automated decisions (see Regulatory Drivers below).
Accountability and auditability. When AI systems make consequential decisions — in medicine, criminal justice, lending, or hiring — stakeholders require a trail of reasoning that can be reviewed, challenged, and corrected.
Debugging and model improvement. Explanations expose which features drive predictions, enabling engineers to detect data leakage, overfitting, or distribution shift that aggregate metrics such as accuracy cannot reveal.
LIME, introduced by Ribeiro et al. in 2016, approximates any black-box model locally around a single prediction using a simpler, interpretable surrogate — typically a sparse linear model.2) The technique perturbs the input, observes the model's response, and fits the surrogate to that local neighbourhood. Because LIME operates on model inputs and outputs only, it is model-agnostic and works with images, text, and tabular data.
SHAP, introduced by Lundberg and Lee in 2017, assigns each feature a contribution value derived from Shapley values — a solution concept from cooperative game theory that guarantees a set of desirable mathematical properties: local accuracy, missingness, and consistency.3) SHAP values are the unique additive feature attributions satisfying these axioms. Efficient implementations such as TreeSHAP allow exact computation for tree-based models; KernelSHAP provides a model-agnostic approximation. SHAP has become the de facto standard for feature attribution in industry.
Transformer-based models expose attention weights, which can be visualised as heatmaps to indicate which tokens a model “attended to” when producing an output. Tools such as BertViz4) render these weights interactively. However, the field has debated whether attention constitutes a genuine explanation: Jain and Wallace (2019) demonstrated that attention distributions can be permuted without changing predictions, challenging the assumption that high attention implies causal importance.5) Subsequent work by Wiegreffe and Pinter partially rebutted these claims, leaving the debate open.
Proposed by Kim et al. (2018), TCAV probes the internal representations of a neural network for human-defined concepts — e.g., “stripes” or “medical instrument” — rather than relying on input features.6) A linear classifier is trained to separate activations of a target layer from those associated with the concept, yielding a Concept Activation Vector (CAV). The method produces a score measuring how sensitive a class prediction is to that concept.
Koh et al. (2020) introduced Concept Bottleneck Models (CBMs), an architecture that forces the network to first predict a set of human-interpretable concepts and then use those concepts — rather than raw features — to produce the final label.7) This design allows interventions at test time: a clinician can override a predicted concept to explore counterfactuals. The trade-off is that concept quality bounds model accuracy.
Integrated Gradients (IG), introduced by Sundararajan et al. (2017), attribute a prediction to each input feature by integrating the model's gradients along a straight-line path from a baseline (e.g., a black image or zero embedding) to the actual input.8) IG satisfies two axioms — Sensitivity and Implementation Invariance — that many simpler gradient methods violate. It is widely used for attributions in computer vision and NLP without requiring model modification.
Governments and standards bodies have moved to codify explainability requirements. The table below summarises key instruments as of early 2026.
| Instrument | Jurisdiction | Key Requirement | Penalty |
|---|---|---|---|
| EU AI Act (in force Aug 2026) | European Union | Article 50 mandates transparency obligations for certain AI systems; high-risk systems must provide logs and explanations sufficient for human oversight | Up to €35 million or 7 % of global turnover |
| GDPR Article 22 | European Union | Individuals subject to solely automated decisions have the right to obtain “meaningful information about the logic involved” | Up to €20 million or 4 % of global turnover |
| US Executive Order on AI (Oct 2023) | United States | Requires federal agencies to assess explainability of AI used in critical decisions; NIST AI RMF guidance referenced | Varies by agency |
| OSFI Guideline B-15 | Canada | Financial institutions must be able to explain AI-driven decisions to affected individuals | Supervisory action |
| MAS FEAT Principles | Singapore | Fairness, Ethics, Accountability, Transparency principles require explainability for financial AI | Supervisory action |
The EU AI Act is the most consequential current instrument: its definition of high-risk AI systems (Annex III) covers biometric identification, critical infrastructure, education, employment, essential services, law enforcement, migration, and administration of justice — each area where XAI is now a compliance requirement rather than a best practice.
Large language models (LLMs) and autonomous agents present qualitatively harder explainability challenges than classical ML models:
Current approaches include:
Probing classifiers. Linear classifiers trained on internal activations test whether specific concepts (syntax, world facts, sentiment) are linearly decodable from a given layer. Probing reveals where knowledge is encoded but does not fully explain how it is used.
Chain-of-Thought (CoT) as explanation. Prompting LLMs to produce step-by-step reasoning (Wei et al., 2022) yields human-readable rationales. Whether these faithfully reflect the model's underlying computation remains an open research question — models can produce plausible CoT that diverges from their actual inference path.
MExGen (IBM, ACL 2025). IBM Research presented MExGen at ACL 2025, a framework for Modular Explanation Generation that decomposes LLM explanations into verifiable sub-claims and grounds each claim against retrieved evidence, combining XAI with retrieval-augmented generation to improve faithfulness.
Mechanistic interpretability. Research groups including Anthropic's interpretability team pursue circuit-level analysis — identifying the minimal sub-graph of attention heads and MLP layers responsible for a specific behaviour. Anthropic has published findings on induction heads, indirect object identification circuits, and superposition, treating the model as a system to be reverse-engineered rather than merely probed from outside.
| Tool | Maintainer | Techniques | Model Types | License |
|---|---|---|---|---|
| SHAP | Community (Lundberg) | Shapley values, TreeSHAP, KernelSHAP, DeepSHAP | Tree, deep, linear, model-agnostic | MIT |
| LIME | Community (Ribeiro) | Local surrogate models | Model-agnostic (tabular, text, image) | BSD |
| InterpretML | Microsoft | EBM, SHAP, LIME, LIME-S | Tabular focus | MIT |
| AIX360 | IBM Research | LIME, SHAP, BRCG, RBFN, TED, ProfWeight | Broad | Apache 2.0 |
| Captum | Meta AI | IG, GradCAM, SHAP, TCAV, LayerConductance | PyTorch models | BSD |
| BertViz | Jesse Vig | Attention head visualisation | Transformers (HuggingFace) | Apache 2.0 |
| What-If Tool | Google PAIR | Counterfactuals, partial dependence, fairness slices | TensorFlow, Scikit-learn | Apache 2.0 |