====== Reasoning-on-Tap ====== Reasoning-on-Tap is the concept that advanced reasoning capabilities — chain-of-thought processing, multi-step deduction, extended thinking — can be **toggled on or off** depending on the task, rather than being always active. It treats reasoning as a dial, not a switch: users pay for deep thinking only when the problem demands it. ((Source: [[https://www.ultralytics.com/glossary/reasoning-models|Ultralytics - Reasoning Models]])) ===== The Two-System Analogy ===== The concept mirrors Daniel Kahneman's **System 1 / System 2** framework from cognitive psychology: * **System 1** (fast): Pattern matching, quick responses, low compute — suitable for simple questions, summarization, and chat * **System 2** (slow): Deliberate step-by-step reasoning, high compute — necessary for math proofs, complex code, scientific analysis Traditional LLMs operate primarily in System 1 mode. Reasoning-on-Tap adds the ability to engage System 2 when needed, then disengage it to save cost and latency. ((Source: [[https://en.wikipedia.org/wiki/Reasoning_model|Wikipedia - Reasoning Model]])) ===== How Models Implement It ===== Different providers offer reasoning as a toggleable capability: * **Anthropic Claude**: Offers an **extended thinking mode** that can be activated per request. When enabled, the model generates internal reasoning traces before producing its final answer. When disabled, it responds with standard speed and cost. ((Source: [[https://hiflylabs.com/blog/2025/4/3/reasoning-models|Hiflylabs - Reasoning Models]])) * **OpenAI o1/o3**: Purpose-built reasoning models that inherently allocate extra inference-time compute for chain-of-thought generation, self-verification, and step-by-step refinement via reinforcement learning. These are separate model endpoints from GPT-4o. ((Source: [[https://www.ultralytics.com/glossary/reasoning-models|Ultralytics - Reasoning Models]])) * **DeepSeek R1**: An open-weights reasoning model trained with RL, demonstrating that reasoning capabilities can be added to existing base models through post-training. ===== Cost Implications ===== Reasoning-on-Tap fundamentally changes the economics of AI inference: * **Standard mode**: Fast, cheap — cents per million tokens for simple tasks * **Reasoning mode**: Slow, expensive — the model generates additional "thinking tokens" that consume compute but are not shown to the user The key insight is that most queries do not need deep reasoning. A customer service chatbot answering FAQs should not pay the cost of a model solving differential equations. Reasoning-on-Tap allows organizations to **match compute cost to task complexity**. ((Source: [[https://www.ultralytics.com/glossary/reasoning-models|Ultralytics - Reasoning Models]])) ===== Impact on Model Selection ===== Reasoning-on-Tap shifts the model selection landscape from choosing a single model to **routing between modes**: | Task Type | Appropriate Mode | Cost Profile | | Simple Q&A, chat | Standard (System 1) | Low cost, low latency | | Summarization, translation | Standard (System 1) | Low cost, low latency | | Math, logic, proofs | Reasoning (System 2) | Higher cost, higher latency | | Complex code generation | Reasoning (System 2) | Higher cost, higher latency | | Multi-step analysis | Reasoning (System 2) | Higher cost, higher latency | Intelligent routing systems can automatically detect when a query warrants reasoning mode, optimizing cost without user intervention. ===== Relationship to Inference-Time Compute ===== Reasoning-on-Tap is closely tied to the broader shift toward **inference-time compute scaling** described in [[post_training_rl_vs_scaling|Post-Training RL vs Model Scaling]]. Instead of spending more on pre-training, reasoning models spend more at inference — generating multiple candidate solutions, verifying each one, and selecting the best. Harder problems get more thinking time; easy problems get less. ((Source: [[https://en.wikipedia.org/wiki/Reasoning_model|Wikipedia - Reasoning Model]])) ===== Limitations ===== * **Hidden reasoning**: Some providers (notably OpenAI with o1) do not expose the full reasoning trace to users, limiting transparency and debuggability * **Domain constraints**: Reasoning mode excels in domains with verifiable answers (math, code) but provides less clear benefit for subjective or creative tasks * **Latency**: Reasoning mode can take seconds to minutes, making it unsuitable for real-time interactive applications * **Cost unpredictability**: The number of thinking tokens varies per query, making costs harder to forecast ===== See Also ===== * [[post_training_rl_vs_scaling|Post-Training RL vs Model Scaling]] * [[ai_self_verification|AI Self-Verification]] * [[inference_economics|Inference Economics]] ===== References =====