Motivation
Two-Stage Framework
- Stage 1: Rationale Generation
- Stage 2: Answer Inference
ScienceQA Benchmark Results
Comparison to Text-Only CoT
Key Advantages
Limitations
See Also
References

Multimodal Chain-of-Thought Prompting

Multimodal Chain-of-Thought (Multimodal-CoT) extends traditional chain-of-thought reasoning beyond text to incorporate multiple modalities, primarily text and images. The approach uses a two-stage framework that separates rationale generation from answer inference, allowing the model to leverage visual information during reasoning.¹⁾

Motivation

Existing chain-of-thought studies have focused primarily on the language modality.²⁾ However, many real-world reasoning tasks involve visual information – diagrams, charts, photographs, and scientific figures. A textbook with no figures or tables severely limits knowledge acquisition. Multimodal-CoT addresses this gap by jointly modeling text and vision modalities during the reasoning process.

Two-Stage Framework

Multimodal-CoT operates through two distinct stages:

Stage 1: Rationale Generation

The model receives both language inputs (question, context, options) and vision inputs (associated images) to generate intermediate reasoning steps called rationales. These rationales combine information from both modalities, articulating visual insights in language form.

For example, given a physics question about magnets with an accompanying diagram, the rationale might state: “The north pole of one magnet is closest to the south pole of the other magnet. Poles that are different attract.”

Stage 2: Answer Inference

The original language input is appended with the rationale generated in Stage 1. This augmented input, along with the original vision input, is used to infer the final answer. The rationale serves as a bridge between visual grounding and explicit textual reasoning.

ScienceQA Benchmark Results

Multimodal-CoT achieved state-of-the-art performance on the ScienceQA benchmark with a model under 1 billion parameters:³⁾

Category	Accuracy
Natural Science	95.91%
Social Science	82.00%
Language Science	90.82%
Text Context	95.26%
Image Context	88.80%
No Context	92.89%
Grades 1-6	92.44%
Grades 7-12	90.31%
Average Accuracy	91.68%

The method was also evaluated on the A-OKVQA benchmark, further demonstrating its effectiveness across multimodal reasoning tasks.

Comparison to Text-Only CoT

Aspect	Text-Only CoT	Multimodal-CoT
Input modalities	Text only	Text + Images
Reasoning basis	Linguistic information	Cross-modal information
Visual understanding	Relies on text descriptions	Directly processes images
Model size needed	Very large (100B+) for best results	Under 1B achieves SOTA
Hallucination	Common in visual tasks	Mitigated through visual grounding

Key Advantages

Hallucination mitigation: Visual grounding helps prevent the model from generating reasoning steps that contradict the image.
Enhanced convergence speed: The two-stage approach converges faster than single-stage multimodal reasoning.
Parameter efficiency: Achieves state-of-the-art with sub-1B parameter models, making it practical for deployment.
Modular design: The two stages can be independently optimized.

Limitations

Requires paired data: Training needs datasets with aligned text and image inputs.
Two-stage complexity: The sequential pipeline adds inference latency compared to single-pass methods.
Vision encoder dependency: Performance is bounded by the quality of the vision encoder used.
Limited modality scope: Currently focused on text and images; does not extend to audio, video, or other modalities.

References

¹⁾

Zhang et al. 2023, Multimodal Chain-of-Thought Reasoning in Language Models

²⁾

Wei et al. 2022, Chain-of-Thought Prompting

³⁾

Results from Zhang et al. 2023 and Papers with Code leaderboard

Table of Contents