Table of Contents

Multimodal Chain-of-Thought Prompting

Multimodal Chain-of-Thought (Multimodal-CoT) extends traditional chain-of-thought reasoning beyond text to incorporate multiple modalities, primarily text and images. The approach uses a two-stage framework that separates rationale generation from answer inference, allowing the model to leverage visual information during reasoning.1)

Motivation

Existing chain-of-thought studies have focused primarily on the language modality.2) However, many real-world reasoning tasks involve visual information – diagrams, charts, photographs, and scientific figures. A textbook with no figures or tables severely limits knowledge acquisition. Multimodal-CoT addresses this gap by jointly modeling text and vision modalities during the reasoning process.

Two-Stage Framework

Multimodal-CoT operates through two distinct stages:

Stage 1: Rationale Generation

The model receives both language inputs (question, context, options) and vision inputs (associated images) to generate intermediate reasoning steps called rationales. These rationales combine information from both modalities, articulating visual insights in language form.

For example, given a physics question about magnets with an accompanying diagram, the rationale might state: “The north pole of one magnet is closest to the south pole of the other magnet. Poles that are different attract.”

Stage 2: Answer Inference

The original language input is appended with the rationale generated in Stage 1. This augmented input, along with the original vision input, is used to infer the final answer. The rationale serves as a bridge between visual grounding and explicit textual reasoning.

ScienceQA Benchmark Results

Multimodal-CoT achieved state-of-the-art performance on the ScienceQA benchmark with a model under 1 billion parameters:3)

Category Accuracy
Natural Science 95.91%
Social Science 82.00%
Language Science 90.82%
Text Context 95.26%
Image Context 88.80%
No Context 92.89%
Grades 1-6 92.44%
Grades 7-12 90.31%
Average Accuracy 91.68%

The method was also evaluated on the A-OKVQA benchmark, further demonstrating its effectiveness across multimodal reasoning tasks.

Comparison to Text-Only CoT

Aspect Text-Only CoT Multimodal-CoT
Input modalities Text only Text + Images
Reasoning basis Linguistic information Cross-modal information
Visual understanding Relies on text descriptions Directly processes images
Model size needed Very large (100B+) for best results Under 1B achieves SOTA
Hallucination Common in visual tasks Mitigated through visual grounding

Key Advantages

Limitations

See Also

References

2)
Wei et al. 2022, Chain-of-Thought Prompting
3)
Results from Zhang et al. 2023 and Papers with Code leaderboard