Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Directional Stimulus Prompting (DSP) is a prompting framework that uses a small, tunable policy model to generate instance-specific hints, keywords, or stimuli that guide a large frozen language model toward desired outputs. Rather than modifying the LLM itself, DSP optimizes a lightweight auxiliary model that produces targeted guidance for each input.1)
DSP introduces a directional stimulus – discrete tokens generated by a small policy model – that is inserted into the prompt fed to a frozen LLM. The framework operates as follows:
For example, in a summarization task, the policy model might extract keywords like “pandemic, vaccines, global response” from an article. These keywords are inserted into the prompt, guiding the LLM to produce a summary that covers those key topics.
The policy model is trained in two stages:2)
The policy model is first trained on labeled data where stimuli are derived from reference outputs. For summarization, keywords are extracted from gold-standard summaries. This provides a warm start for the policy.
The policy is further refined using reinforcement learning with rewards based on LLM output quality (e.g., ROUGE scores for summarization or human preference scores). This allows the policy to explore stimulus strategies that produce better LLM outputs than those found through supervised training alone.
This approach converts LLM optimization into a much cheaper policy model optimization problem, enabling fine-grained, instance-specific control over black-box LLMs.
DSP consistently outperforms standard prompting baselines:3)
| Task | Standard Prompting | DSP (SFT) | DSP (RL) |
| Summarization (ROUGE-L) | ~25-30 | +1-2 pts | +2-3 pts |
| Dialogue (Preference) | Baseline | Improved | Best |
Key findings: