====== Multimodal Chain-of-Thought Prompting ======

Multimodal Chain-of-Thought (Multimodal-CoT) extends traditional chain-of-thought reasoning beyond text to incorporate multiple modalities, primarily text and images. The approach uses a two-stage framework that separates rationale generation from answer inference, allowing the model to leverage visual information during reasoning.((Zhang et al. 2023, [[https://arxiv.org/abs/2302.00923|Multimodal Chain-of-Thought Reasoning in Language Models]]))

===== Motivation =====

Existing chain-of-thought studies have focused primarily on the language modality.((Wei et al. 2022, Chain-of-Thought Prompting)) However, many real-world reasoning tasks involve visual information -- diagrams, charts, photographs, and scientific figures. A textbook with no figures or tables severely limits knowledge acquisition. Multimodal-CoT addresses this gap by jointly modeling text and vision modalities during the reasoning process.

===== Two-Stage Framework =====

Multimodal-CoT operates through two distinct stages:

==== Stage 1: Rationale Generation ====

The model receives both **language inputs** (question, context, options) and **vision inputs** (associated images) to generate intermediate reasoning steps called rationales. These rationales combine information from both modalities, articulating visual insights in language form.

For example, given a physics question about magnets with an accompanying diagram, the rationale might state: "The north pole of one magnet is closest to the south pole of the other magnet. Poles that are different attract."

==== Stage 2: Answer Inference ====

The original language input is **appended with the rationale** generated in Stage 1. This augmented input, along with the original vision input, is used to infer the final answer. The rationale serves as a bridge between visual grounding and explicit textual reasoning.

===== ScienceQA Benchmark Results =====

Multimodal-CoT achieved state-of-the-art performance on the ScienceQA benchmark with a model under 1 billion parameters:((Results from Zhang et al. 2023 and Papers with Code leaderboard))

| **Category** | **Accuracy** |
| Natural Science | 95.91% |
| Social Science | 82.00% |
| Language Science | 90.82% |
| Text Context | 95.26% |
| Image Context | 88.80% |
| No Context | 92.89% |
| Grades 1-6 | 92.44% |
| Grades 7-12 | 90.31% |
| **Average Accuracy** | **91.68%** |

The method was also evaluated on the A-OKVQA benchmark, further demonstrating its effectiveness across multimodal reasoning tasks.

===== Comparison to Text-Only CoT =====

| **Aspect** | **Text-Only CoT** | **Multimodal-CoT** |
| Input modalities | Text only | Text + Images |
| Reasoning basis | Linguistic information | Cross-modal information |
| Visual understanding | Relies on text descriptions | Directly processes images |
| Model size needed | Very large (100B+) for best results | Under 1B achieves SOTA |
| Hallucination | Common in visual tasks | Mitigated through visual grounding |

===== Key Advantages =====

  * **Hallucination mitigation**: Visual grounding helps prevent the model from generating reasoning steps that contradict the image.
  * **Enhanced convergence speed**: The two-stage approach converges faster than single-stage multimodal reasoning.
  * **Parameter efficiency**: Achieves state-of-the-art with sub-1B parameter models, making it practical for deployment.
  * **Modular design**: The two stages can be independently optimized.

===== Limitations =====

  * **Requires paired data**: Training needs datasets with aligned text and image inputs.
  * **Two-stage complexity**: The sequential pipeline adds inference latency compared to single-pass methods.
  * **Vision encoder dependency**: Performance is bounded by the quality of the vision encoder used.
  * **Limited modality scope**: Currently focused on text and images; does not extend to audio, video, or other modalities.

===== See Also =====

  * [[prompt_engineering]]
  * [[chain_of_thought_prompting]]
  * [[zero_shot_prompting]]
  * [[few_shot_prompting]]
  * [[program_aided_language_models]]

===== References =====