====== Multimodal Chain-of-Thought Prompting ====== Multimodal Chain-of-Thought (Multimodal-CoT) extends traditional chain-of-thought reasoning beyond text to incorporate multiple modalities, primarily text and images. The approach uses a two-stage framework that separates rationale generation from answer inference, allowing the model to leverage visual information during reasoning.((Zhang et al. 2023, [[https://arxiv.org/abs/2302.00923|Multimodal Chain-of-Thought Reasoning in Language Models]])) ===== Motivation ===== Existing chain-of-thought studies have focused primarily on the language modality.((Wei et al. 2022, Chain-of-Thought Prompting)) However, many real-world reasoning tasks involve visual information -- diagrams, charts, photographs, and scientific figures. A textbook with no figures or tables severely limits knowledge acquisition. Multimodal-CoT addresses this gap by jointly modeling text and vision modalities during the reasoning process. ===== Two-Stage Framework ===== Multimodal-CoT operates through two distinct stages: ==== Stage 1: Rationale Generation ==== The model receives both **language inputs** (question, context, options) and **vision inputs** (associated images) to generate intermediate reasoning steps called rationales. These rationales combine information from both modalities, articulating visual insights in language form. For example, given a physics question about magnets with an accompanying diagram, the rationale might state: "The north pole of one magnet is closest to the south pole of the other magnet. Poles that are different attract." ==== Stage 2: Answer Inference ==== The original language input is **appended with the rationale** generated in Stage 1. This augmented input, along with the original vision input, is used to infer the final answer. The rationale serves as a bridge between visual grounding and explicit textual reasoning. ===== ScienceQA Benchmark Results ===== Multimodal-CoT achieved state-of-the-art performance on the ScienceQA benchmark with a model under 1 billion parameters:((Results from Zhang et al. 2023 and Papers with Code leaderboard)) | **Category** | **Accuracy** | | Natural Science | 95.91% | | Social Science | 82.00% | | Language Science | 90.82% | | Text Context | 95.26% | | Image Context | 88.80% | | No Context | 92.89% | | Grades 1-6 | 92.44% | | Grades 7-12 | 90.31% | | **Average Accuracy** | **91.68%** | The method was also evaluated on the A-OKVQA benchmark, further demonstrating its effectiveness across multimodal reasoning tasks. ===== Comparison to Text-Only CoT ===== | **Aspect** | **Text-Only CoT** | **Multimodal-CoT** | | Input modalities | Text only | Text + Images | | Reasoning basis | Linguistic information | Cross-modal information | | Visual understanding | Relies on text descriptions | Directly processes images | | Model size needed | Very large (100B+) for best results | Under 1B achieves SOTA | | Hallucination | Common in visual tasks | Mitigated through visual grounding | ===== Key Advantages ===== * **Hallucination mitigation**: Visual grounding helps prevent the model from generating reasoning steps that contradict the image. * **Enhanced convergence speed**: The two-stage approach converges faster than single-stage multimodal reasoning. * **Parameter efficiency**: Achieves state-of-the-art with sub-1B parameter models, making it practical for deployment. * **Modular design**: The two stages can be independently optimized. ===== Limitations ===== * **Requires paired data**: Training needs datasets with aligned text and image inputs. * **Two-stage complexity**: The sequential pipeline adds inference latency compared to single-pass methods. * **Vision encoder dependency**: Performance is bounded by the quality of the vision encoder used. * **Limited modality scope**: Currently focused on text and images; does not extend to audio, video, or other modalities. ===== See Also ===== * [[prompt_engineering]] * [[chain_of_thought_prompting]] * [[zero_shot_prompting]] * [[few_shot_prompting]] * [[program_aided_language_models]] ===== References =====