Table of Contents

Directional Stimulus Prompting

Directional Stimulus Prompting (DSP) is a prompting framework that uses a small, tunable policy model to generate instance-specific hints, keywords, or stimuli that guide a large frozen language model toward desired outputs. Rather than modifying the LLM itself, DSP optimizes a lightweight auxiliary model that produces targeted guidance for each input.1)

How It Works

DSP introduces a directional stimulus – discrete tokens generated by a small policy model – that is inserted into the prompt fed to a frozen LLM. The framework operates as follows:

  1. A lightweight policy model (e.g., T5-small or T5-base) processes the input to produce a stimulus (hints, keywords, or clues).
  2. The stimulus is concatenated into the prompt alongside the original input.
  3. The augmented prompt is fed to the frozen LLM (e.g., GPT-3), which generates the final output guided by the stimulus.

For example, in a summarization task, the policy model might extract keywords like “pandemic, vaccines, global response” from an article. These keywords are inserted into the prompt, guiding the LLM to produce a summary that covers those key topics.

Training the Policy Model

The policy model is trained in two stages:2)

Supervised Fine-Tuning (SFT)

The policy model is first trained on labeled data where stimuli are derived from reference outputs. For summarization, keywords are extracted from gold-standard summaries. This provides a warm start for the policy.

Reinforcement Learning (RL)

The policy is further refined using reinforcement learning with rewards based on LLM output quality (e.g., ROUGE scores for summarization or human preference scores). This allows the policy to explore stimulus strategies that produce better LLM outputs than those found through supervised training alone.

This approach converts LLM optimization into a much cheaper policy model optimization problem, enabling fine-grained, instance-specific control over black-box LLMs.

Applications

Benchmark Results

DSP consistently outperforms standard prompting baselines:3)

Task Standard Prompting DSP (SFT) DSP (RL)
Summarization (ROUGE-L) ~25-30 +1-2 pts +2-3 pts
Dialogue (Preference) Baseline Improved Best

Key findings:

Advantages

Limitations

See Also

References

2)
Li et al. 2023, Section 3
3)
Li et al. 2023, experimental results