explainable

A model is considered explainable when a human stakeholder can understand why it produced a given output, not just what the output was. This distinction is critical across a growing range of applications.

Trust and adoption. Practitioners and end-users are less likely to deploy or rely on systems they cannot interrogate. Explainability builds warranted confidence and allows operators to identify failure modes before deployment.

Bias detection and fairness. XAI tools can reveal that a model has learned spurious or discriminatory correlations. In 2021 the Dutch financial regulator fined an unnamed bank approximately 47 million euros after an audit, aided by model explanation techniques, revealed that an automated credit-scoring system systematically disadvantaged applicants from certain postal codes.¹⁾

Regulatory compliance. Legislation in multiple jurisdictions now mandates or incentivises explainability for high-stakes automated decisions (see Regulatory Drivers below).

Accountability and auditability. When AI systems make consequential decisions — in medicine, criminal justice, lending, or hiring — stakeholders require a trail of reasoning that can be reviewed, challenged, and corrected.

Debugging and model improvement. Explanations expose which features drive predictions, enabling engineers to detect data leakage, overfitting, or distribution shift that aggregate metrics such as accuracy cannot reveal.

Governments and standards bodies have moved to codify explainability requirements. The table below summarises key instruments as of early 2026.

Instrument	Jurisdiction	Key Requirement	Penalty
EU AI Act (in force Aug 2026)	European Union	Article 50 mandates transparency obligations for certain AI systems; high-risk systems must provide logs and explanations sufficient for human oversight	Up to €35 million or 7 % of global turnover
GDPR Article 22	European Union	Individuals subject to solely automated decisions have the right to obtain “meaningful information about the logic involved”	Up to €20 million or 4 % of global turnover
US Executive Order on AI (Oct 2023)	United States	Requires federal agencies to assess explainability of AI used in critical decisions; NIST AI RMF guidance referenced	Varies by agency
OSFI Guideline B-15	Canada	Financial institutions must be able to explain AI-driven decisions to affected individuals	Supervisory action
MAS FEAT Principles	Singapore	Fairness, Ethics, Accountability, Transparency principles require explainability for financial AI	Supervisory action

The EU AI Act is the most consequential current instrument: its definition of high-risk AI systems (Annex III) covers biometric identification, critical infrastructure, education, employment, essential services, law enforcement, migration, and administration of justice — each area where XAI is now a compliance requirement rather than a best practice.

Large language models (LLMs) and autonomous agents present qualitatively harder explainability challenges than classical ML models:

Scale. Models with hundreds of billions of parameters resist the feature-attribution techniques designed for shallow networks or tabular data.
Generative output. Unlike classification, generated text lacks a single scalar output against which to compute gradients or Shapley values in a tractable way.
Multi-step reasoning. Agent behaviour emerges from sequences of actions; explaining any single step may miss emergent dynamics across the trajectory.

Current approaches include:

Probing classifiers. Linear classifiers trained on internal activations test whether specific concepts (syntax, world facts, sentiment) are linearly decodable from a given layer. Probing reveals where knowledge is encoded but does not fully explain how it is used.

Chain-of-Thought (CoT) as explanation. Prompting LLMs to produce step-by-step reasoning (Wei et al., 2022) yields human-readable rationales. Whether these faithfully reflect the model's underlying computation remains an open research question — models can produce plausible CoT that diverges from their actual inference path.

MExGen (IBM, ACL 2025). IBM Research presented MExGen at ACL 2025, a framework for Modular Explanation Generation that decomposes LLM explanations into verifiable sub-claims and grounds each claim against retrieved evidence, combining XAI with retrieval-augmented generation to improve faithfulness.

Mechanistic interpretability. Research groups including Anthropic's interpretability team pursue circuit-level analysis — identifying the minimal sub-graph of attention heads and MLP layers responsible for a specific behaviour. Anthropic has published findings on induction heads, indirect object identification circuits, and superposition, treating the model as a system to be reverse-engineered rather than merely probed from outside.

Tool	Maintainer	Techniques	Model Types	License
SHAP	Community (Lundberg)	Shapley values, TreeSHAP, KernelSHAP, DeepSHAP	Tree, deep, linear, model-agnostic	MIT
LIME	Community (Ribeiro)	Local surrogate models	Model-agnostic (tabular, text, image)	BSD
InterpretML	Microsoft	EBM, SHAP, LIME, LIME-S	Tabular focus	MIT
AIX360	IBM Research	LIME, SHAP, BRCG, RBFN, TED, ProfWeight	Broad	Apache 2.0
Captum	Meta AI	IG, GradCAM, SHAP, TCAV, LayerConductance	PyTorch models	BSD
BertViz	Jesse Vig	Attention head visualisation	Transformers (HuggingFace)	Apache 2.0
What-If Tool	Google PAIR	Counterfactuals, partial dependence, fairness slices	TensorFlow, Scikit-learn	Apache 2.0

Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. Proceedings of KDD 2016. https://arxiv.org/abs/1602.04938
Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions. NeurIPS 2017. https://arxiv.org/abs/1705.07874
Jain, S., & Wallace, B. C. (2019). Attention is not Explanation. NAACL 2019. https://arxiv.org/abs/1902.10186
Kim, B. et al. (2018). Interpretability Beyond Classification Labels: Quantitative Testing with Concept Activation Vectors (TCAV). ICML 2018. https://arxiv.org/abs/1711.11279
Koh, P. W. et al. (2020). Concept Bottleneck Models. ICML 2020. https://arxiv.org/abs/2007.04612
Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic Attribution for Deep Networks. ICML 2017. https://arxiv.org/abs/1703.01365
European Parliament. (2024). EU Artificial Intelligence Act. https://www.europarl.europa.eu/news/en/headlines/society/20201015STO89417/eu-ai-act-first-regulation-on-artificial-intelligence

Table of Contents

Explainable AI (XAI)

Definition and Importance

Key Techniques

LIME (Local Interpretable Model-Agnostic Explanations)

SHAP (SHapley Additive exPlanations)

Attention Visualization

TCAV (Testing with Concept Activation Vectors)

Concept Bottleneck Models

Integrated Gradients

Regulatory Drivers

XAI for LLMs and Agents

XAI Tools

See Also

References