Skeleton-of-Thought (SoT) is a decoding acceleration method introduced by Ning et al. in 2023 that reduces LLM generation latency by up to 2.39x through a two-stage process: first generating a concise answer skeleton, then expanding each point in parallel. The method requires no model modifications or specialized hardware – it works as a pure prompting and orchestration strategy.
Standard LLM inference is inherently sequential: tokens are generated one at a time in autoregressive fashion, making latency proportional to output length. SoT breaks this bottleneck by observing that many answers are naturally decomposable into independent points that can be expanded concurrently. This mirrors how humans often outline their thoughts before elaborating.
SoT operates in two stages:
The final answer concatenates all expanded points, optionally with a refinement pass for fluency.
For standard sequential decoding, latency scales linearly with output length:
$$T_{\text{seq}}(N) \approx N \cdot t$$
where $N$ is the number of output tokens and $t$ is per-token generation time. With SoT and $K$ parallel expansion points:
$$T_{\text{SoT}} \approx T_{\text{skeleton}} + \max_{i \in \{1,\ldots,K\}} T_{\text{point}_i}$$
For balanced point lengths, this yields approximately $T_{\text{seq}} / K$ speedup. In practice, the skeleton is short (typically under 100 tokens), making $T_{\text{skeleton}}$ negligible relative to expansion time.
SoT includes a router that dynamically selects expansion prompts based on point type:
This improves expansion quality by avoiding generic one-size-fits-all prompts.
import asyncio import openai async def skeleton_of_thought(query, client): # Stage 1: Generate skeleton skeleton_prompt = ( f"Question: {query}\n" "Give a skeleton outline of your answer with 3-7 key points. " "Each point should be a concise phrase, one per line. " "Do not elaborate -- just the skeleton." ) skeleton_resp = await client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": skeleton_prompt}] ) points = [ p.strip() for p in skeleton_resp.choices[0].message.content.split("\n") if p.strip() ] # Stage 2: Expand points in parallel async def expand_point(point, idx): expand_prompt = ( f"Question: {query}\n" f"Expand on this specific point with 2-3 detailed sentences:\n" f"Point {idx + 1}: {point}" ) resp = await client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": expand_prompt}] ) return resp.choices[0].message.content expansions = await asyncio.gather( *[expand_point(p, i) for i, p in enumerate(points)] ) # Combine return "\n\n".join( f"**{point}**\n{expansion}" for point, expansion in zip(points, expansions) )
Evaluated on the Topical-200 benchmark across 12 LLMs:
| Model | Speedup | Notes |
|---|---|---|
| GPT-4 | 2.0x | Parallel API calls |
| Vicuna-13B | 2.4x | Batched GPU decoding |
| LLaMA-65B | 2.3x | Larger models benefit more |
| Average (12 models) | 1.9-2.39x | Consistent across architectures |
Quality assessment (1800 human-evaluated answers):