Table of Contents

Explainable AI (XAI)

Explainable AI (XAI) refers to the processes, methods, and techniques that make the outputs and internal decisions of artificial intelligence systems understandable and interpretable by humans. XAI directly addresses the “black box” problem — the opacity of modern machine learning models, particularly deep neural networks, whose internal reasoning is not directly observable.

Definition and Importance

A model is considered explainable when a human stakeholder can understand why it produced a given output, not just what the output was. This distinction is critical across a growing range of applications.

Trust and adoption. Practitioners and end-users are less likely to deploy or rely on systems they cannot interrogate. Explainability builds warranted confidence and allows operators to identify failure modes before deployment.

Bias detection and fairness. XAI tools can reveal that a model has learned spurious or discriminatory correlations. In 2021 the Dutch financial regulator fined an unnamed bank approximately 47 million euros after an audit, aided by model explanation techniques, revealed that an automated credit-scoring system systematically disadvantaged applicants from certain postal codes.1)

Regulatory compliance. Legislation in multiple jurisdictions now mandates or incentivises explainability for high-stakes automated decisions (see Regulatory Drivers below).

Accountability and auditability. When AI systems make consequential decisions — in medicine, criminal justice, lending, or hiring — stakeholders require a trail of reasoning that can be reviewed, challenged, and corrected.

Debugging and model improvement. Explanations expose which features drive predictions, enabling engineers to detect data leakage, overfitting, or distribution shift that aggregate metrics such as accuracy cannot reveal.

Key Techniques

LIME (Local Interpretable Model-Agnostic Explanations)

LIME, introduced by Ribeiro et al. in 2016, approximates any black-box model locally around a single prediction using a simpler, interpretable surrogate — typically a sparse linear model.2) The technique perturbs the input, observes the model's response, and fits the surrogate to that local neighbourhood. Because LIME operates on model inputs and outputs only, it is model-agnostic and works with images, text, and tabular data.

SHAP (SHapley Additive exPlanations)

SHAP, introduced by Lundberg and Lee in 2017, assigns each feature a contribution value derived from Shapley values — a solution concept from cooperative game theory that guarantees a set of desirable mathematical properties: local accuracy, missingness, and consistency.3) SHAP values are the unique additive feature attributions satisfying these axioms. Efficient implementations such as TreeSHAP allow exact computation for tree-based models; KernelSHAP provides a model-agnostic approximation. SHAP has become the de facto standard for feature attribution in industry.

Attention Visualization

Transformer-based models expose attention weights, which can be visualised as heatmaps to indicate which tokens a model “attended to” when producing an output. Tools such as BertViz4) render these weights interactively. However, the field has debated whether attention constitutes a genuine explanation: Jain and Wallace (2019) demonstrated that attention distributions can be permuted without changing predictions, challenging the assumption that high attention implies causal importance.5) Subsequent work by Wiegreffe and Pinter partially rebutted these claims, leaving the debate open.

TCAV (Testing with Concept Activation Vectors)

Proposed by Kim et al. (2018), TCAV probes the internal representations of a neural network for human-defined concepts — e.g., “stripes” or “medical instrument” — rather than relying on input features.6) A linear classifier is trained to separate activations of a target layer from those associated with the concept, yielding a Concept Activation Vector (CAV). The method produces a score measuring how sensitive a class prediction is to that concept.

Concept Bottleneck Models

Koh et al. (2020) introduced Concept Bottleneck Models (CBMs), an architecture that forces the network to first predict a set of human-interpretable concepts and then use those concepts — rather than raw features — to produce the final label.7) This design allows interventions at test time: a clinician can override a predicted concept to explore counterfactuals. The trade-off is that concept quality bounds model accuracy.

Integrated Gradients

Integrated Gradients (IG), introduced by Sundararajan et al. (2017), attribute a prediction to each input feature by integrating the model's gradients along a straight-line path from a baseline (e.g., a black image or zero embedding) to the actual input.8) IG satisfies two axioms — Sensitivity and Implementation Invariance — that many simpler gradient methods violate. It is widely used for attributions in computer vision and NLP without requiring model modification.

Regulatory Drivers

Governments and standards bodies have moved to codify explainability requirements. The table below summarises key instruments as of early 2026.

Instrument Jurisdiction Key Requirement Penalty
EU AI Act (in force Aug 2026) European Union Article 50 mandates transparency obligations for certain AI systems; high-risk systems must provide logs and explanations sufficient for human oversight Up to €35 million or 7 % of global turnover
GDPR Article 22 European Union Individuals subject to solely automated decisions have the right to obtain “meaningful information about the logic involved” Up to €20 million or 4 % of global turnover
US Executive Order on AI (Oct 2023) United States Requires federal agencies to assess explainability of AI used in critical decisions; NIST AI RMF guidance referenced Varies by agency
OSFI Guideline B-15 Canada Financial institutions must be able to explain AI-driven decisions to affected individuals Supervisory action
MAS FEAT Principles Singapore Fairness, Ethics, Accountability, Transparency principles require explainability for financial AI Supervisory action

The EU AI Act is the most consequential current instrument: its definition of high-risk AI systems (Annex III) covers biometric identification, critical infrastructure, education, employment, essential services, law enforcement, migration, and administration of justice — each area where XAI is now a compliance requirement rather than a best practice.

XAI for LLMs and Agents

Large language models (LLMs) and autonomous agents present qualitatively harder explainability challenges than classical ML models:

Current approaches include:

Probing classifiers. Linear classifiers trained on internal activations test whether specific concepts (syntax, world facts, sentiment) are linearly decodable from a given layer. Probing reveals where knowledge is encoded but does not fully explain how it is used.

Chain-of-Thought (CoT) as explanation. Prompting LLMs to produce step-by-step reasoning (Wei et al., 2022) yields human-readable rationales. Whether these faithfully reflect the model's underlying computation remains an open research question — models can produce plausible CoT that diverges from their actual inference path.

MExGen (IBM, ACL 2025). IBM Research presented MExGen at ACL 2025, a framework for Modular Explanation Generation that decomposes LLM explanations into verifiable sub-claims and grounds each claim against retrieved evidence, combining XAI with retrieval-augmented generation to improve faithfulness.

Mechanistic interpretability. Research groups including Anthropic's interpretability team pursue circuit-level analysis — identifying the minimal sub-graph of attention heads and MLP layers responsible for a specific behaviour. Anthropic has published findings on induction heads, indirect object identification circuits, and superposition, treating the model as a system to be reverse-engineered rather than merely probed from outside.

XAI Tools

Tool Maintainer Techniques Model Types License
SHAP Community (Lundberg) Shapley values, TreeSHAP, KernelSHAP, DeepSHAP Tree, deep, linear, model-agnostic MIT
LIME Community (Ribeiro) Local surrogate models Model-agnostic (tabular, text, image) BSD
InterpretML Microsoft EBM, SHAP, LIME, LIME-S Tabular focus MIT
AIX360 IBM Research LIME, SHAP, BRCG, RBFN, TED, ProfWeight Broad Apache 2.0
Captum Meta AI IG, GradCAM, SHAP, TCAV, LayerConductance PyTorch models BSD
BertViz Jesse Vig Attention head visualisation Transformers (HuggingFace) Apache 2.0
What-If Tool Google PAIR Counterfactuals, partial dependence, fairness slices TensorFlow, Scikit-learn Apache 2.0

See Also

References

  1. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. Proceedings of KDD 2016. https://arxiv.org/abs/1602.04938
  2. Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 2017. https://arxiv.org/abs/1705.07874
  3. Jain, S., & Wallace, B. C. (2019). Attention is not Explanation. NAACL 2019. https://arxiv.org/abs/1902.10186
  4. Kim, B. et al. (2018). Interpretability Beyond Classification Labels: Quantitative Testing with Concept Activation Vectors (TCAV). ICML 2018. https://arxiv.org/abs/1711.11279
  5. Koh, P. W. et al. (2020). Concept Bottleneck Models. ICML 2020. https://arxiv.org/abs/2007.04612
  6. Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic Attribution for Deep Networks. ICML 2017. https://arxiv.org/abs/1703.01365
1)
For background on regulatory enforcement trends in algorithmic decision-making see European Parliament AI Act overview
2)
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. arXiv:1602.04938
3)
Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. arXiv:1705.07874
4)
Vig, J. (2019). A Multiscale Visualization of Attention in the Transformer Model. arXiv:1906.05714
5)
Jain, S., & Wallace, B. C. (2019). Attention is not Explanation. arXiv:1902.10186
6)
Kim, B., Wattenberg, M., Gilmer, J., Carmon, C., Dohan, S., Hashmi, A., & Mozer, M. (2018). Interpretability Beyond Classification Labels: Quantitative Testing with Concept Activation Vectors (TCAV). arXiv:1711.11279
7)
Koh, P. W., Nguyen, T., Tang, Y. S., Mussmann, S., Pierson, E., Kim, B., & Liang, P. (2020). Concept Bottleneck Models. arXiv:2007.04612
8)
Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic Attribution for Deep Networks. arXiv:1703.01365