Multimodal Chain-of-Thought (Multimodal-CoT) extends traditional chain-of-thought reasoning beyond text to incorporate multiple modalities, primarily text and images. The approach uses a two-stage framework that separates rationale generation from answer inference, allowing the model to leverage visual information during reasoning.1)
Existing chain-of-thought studies have focused primarily on the language modality.2) However, many real-world reasoning tasks involve visual information – diagrams, charts, photographs, and scientific figures. A textbook with no figures or tables severely limits knowledge acquisition. Multimodal-CoT addresses this gap by jointly modeling text and vision modalities during the reasoning process.
Multimodal-CoT operates through two distinct stages:
The model receives both language inputs (question, context, options) and vision inputs (associated images) to generate intermediate reasoning steps called rationales. These rationales combine information from both modalities, articulating visual insights in language form.
For example, given a physics question about magnets with an accompanying diagram, the rationale might state: “The north pole of one magnet is closest to the south pole of the other magnet. Poles that are different attract.”
The original language input is appended with the rationale generated in Stage 1. This augmented input, along with the original vision input, is used to infer the final answer. The rationale serves as a bridge between visual grounding and explicit textual reasoning.
Multimodal-CoT achieved state-of-the-art performance on the ScienceQA benchmark with a model under 1 billion parameters:3)
| Category | Accuracy |
| Natural Science | 95.91% |
| Social Science | 82.00% |
| Language Science | 90.82% |
| Text Context | 95.26% |
| Image Context | 88.80% |
| No Context | 92.89% |
| Grades 1-6 | 92.44% |
| Grades 7-12 | 90.31% |
| Average Accuracy | 91.68% |
The method was also evaluated on the A-OKVQA benchmark, further demonstrating its effectiveness across multimodal reasoning tasks.
| Aspect | Text-Only CoT | Multimodal-CoT |
| Input modalities | Text only | Text + Images |
| Reasoning basis | Linguistic information | Cross-modal information |
| Visual understanding | Relies on text descriptions | Directly processes images |
| Model size needed | Very large (100B+) for best results | Under 1B achieves SOTA |
| Hallucination | Common in visual tasks | Mitigated through visual grounding |