How It Works
Training the Policy Model
- Supervised Fine-Tuning (SFT)
- Reinforcement Learning (RL)
Applications
Benchmark Results
Advantages
Limitations
See Also
References

Directional Stimulus Prompting

Directional Stimulus Prompting (DSP) is a prompting framework that uses a small, tunable policy model to generate instance-specific hints, keywords, or stimuli that guide a large frozen language model toward desired outputs. Rather than modifying the LLM itself, DSP optimizes a lightweight auxiliary model that produces targeted guidance for each input.¹⁾

How It Works

DSP introduces a directional stimulus – discrete tokens generated by a small policy model – that is inserted into the prompt fed to a frozen LLM. The framework operates as follows:

A lightweight policy model (e.g., T5-small or T5-base) processes the input to produce a stimulus (hints, keywords, or clues).
The stimulus is concatenated into the prompt alongside the original input.
The augmented prompt is fed to the frozen LLM (e.g., GPT-3), which generates the final output guided by the stimulus.

For example, in a summarization task, the policy model might extract keywords like “pandemic, vaccines, global response” from an article. These keywords are inserted into the prompt, guiding the LLM to produce a summary that covers those key topics.

Training the Policy Model

The policy model is trained in two stages:²⁾

Supervised Fine-Tuning (SFT)

The policy model is first trained on labeled data where stimuli are derived from reference outputs. For summarization, keywords are extracted from gold-standard summaries. This provides a warm start for the policy.

Reinforcement Learning (RL)

The policy is further refined using reinforcement learning with rewards based on LLM output quality (e.g., ROUGE scores for summarization or human preference scores). This allows the policy to explore stimulus strategies that produce better LLM outputs than those found through supervised training alone.

This approach converts LLM optimization into a much cheaper policy model optimization problem, enabling fine-grained, instance-specific control over black-box LLMs.

Applications

Summarization: Policy generates keywords to ensure summaries cover key entities and topics.
Dialogue response generation: Stimuli provide contextual hints for more relevant, engaging responses.
Chain-of-thought reasoning: Directional stimuli outperform hand-crafted or auto-engineered prompts for reasoning tasks.

Benchmark Results

DSP consistently outperforms standard prompting baselines:³⁾

Task	Standard Prompting	DSP (SFT)	DSP (RL)
Summarization (ROUGE-L)	~25-30	+1-2 pts	+2-3 pts
Dialogue (Preference)	Baseline	Improved	Best

Key findings:

RL-trained policies consistently outperform SFT-only policies.
DSP achieves approximately 5-10% gains on black-box LLMs like GPT-3/3.5.
The policy model remains efficient (T5-base scale), making the approach practical.

Advantages

Instance-specific guidance: Unlike fixed prompts, each input receives tailored stimulus.
No LLM modification needed: Works with frozen, black-box LLMs.
Cost-effective: Tuning a small policy model is far cheaper than fine-tuning the LLM.
Composable: Can be combined with other prompting techniques.

Limitations

Requires training data: The policy model needs labeled data for SFT and a reward signal for RL.
Two-model system: Adds architectural complexity compared to single-model prompting.
Domain-specific: Policy models are trained per task and may not generalize across domains.
Stimulus quality ceiling: Output quality is bounded by the policy model's ability to identify useful stimuli.

References

¹⁾

Li et al. 2023, Guiding Large Language Models via Directional Stimulus Prompting, NeurIPS 2023

²⁾

Li et al. 2023, Section 3

³⁾

Li et al. 2023, experimental results

Table of Contents