====== Skeleton-of-Thought (SoT) ====== **Skeleton-of-Thought (SoT)** is a decoding acceleration method introduced by Ning et al. in 2023 that reduces LLM generation latency by up to 2.39x through a two-stage process: first generating a concise answer skeleton, then expanding each point in parallel. The method requires no model modifications or specialized hardware -- it works as a pure prompting and orchestration strategy. ===== Overview ===== Standard LLM inference is inherently sequential: tokens are generated one at a time in autoregressive fashion, making latency proportional to output length. SoT breaks this bottleneck by observing that many answers are naturally decomposable into independent points that can be expanded concurrently. This mirrors how humans often outline their thoughts before elaborating. ===== Two-Stage Process ===== SoT operates in two stages: - **Stage 1 -- Skeleton Generation**: The LLM generates a concise outline of 3-10 key points covering the main aspects of the answer, without details or filler. This is a short generation that completes quickly. - **Stage 2 -- Parallel Point Expansion**: Each skeleton point is expanded independently and concurrently. For API-based models (e.g., GPT-4), this uses parallel API calls. For open-source models (e.g., LLaMA), this uses batched decoding on the GPU. The final answer concatenates all expanded points, optionally with a refinement pass for fluency. ===== Latency Analysis ===== For standard sequential decoding, latency scales linearly with output length: $$T_{\text{seq}}(N) \approx N \cdot t$$ where $N$ is the number of output tokens and $t$ is per-token generation time. With SoT and $K$ parallel expansion points: $$T_{\text{SoT}} \approx T_{\text{skeleton}} + \max_{i \in \{1,\ldots,K\}} T_{\text{point}_i}$$ For balanced point lengths, this yields approximately $T_{\text{seq}} / K$ speedup. In practice, the skeleton is short (typically under 100 tokens), making $T_{\text{skeleton}}$ negligible relative to expansion time. ===== The Router Module ===== SoT includes a **router** that dynamically selects expansion prompts based on point type: * Classifies each skeleton point into categories (e.g., "description", "example", "reason") via a lightweight LLM call * Selects tailored expansion prompts from a predefined library * Adds minimal overhead (~5% of total time) This improves expansion quality by avoiding generic one-size-fits-all prompts. ===== Code Example ===== import asyncio import openai async def skeleton_of_thought(query, client): # Stage 1: Generate skeleton skeleton_prompt = ( f"Question: {query}\n" "Give a skeleton outline of your answer with 3-7 key points. " "Each point should be a concise phrase, one per line. " "Do not elaborate -- just the skeleton." ) skeleton_resp = await client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": skeleton_prompt}] ) points = [ p.strip() for p in skeleton_resp.choices[0].message.content.split("\n") if p.strip() ] # Stage 2: Expand points in parallel async def expand_point(point, idx): expand_prompt = ( f"Question: {query}\n" f"Expand on this specific point with 2-3 detailed sentences:\n" f"Point {idx + 1}: {point}" ) resp = await client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": expand_prompt}] ) return resp.choices[0].message.content expansions = await asyncio.gather( *[expand_point(p, i) for i, p in enumerate(points)] ) # Combine return "\n\n".join( f"**{point}**\n{expansion}" for point, expansion in zip(points, expansions) ) ===== Experimental Results ===== Evaluated on the Topical-200 benchmark across 12 LLMs: ^ Model ^ Speedup ^ Notes ^ | GPT-4 | 2.0x | Parallel API calls | | Vicuna-13B | 2.4x | Batched GPU decoding | | LLaMA-65B | 2.3x | Larger models benefit more | | Average (12 models) | 1.9-2.39x | Consistent across architectures | **Quality assessment** (1800 human-evaluated answers): * SoT wins 60% of comparisons on coherence and immersion * Superior comprehensiveness due to multi-aspect skeleton planning * Higher diversity (lower Self-BLEU scores) * Comparable perplexity to sequential baselines ===== Limitations ===== * **Dependency ignorance**: Assumes skeleton points are independent. Fails on sequential tasks (e.g., multi-step math where later steps depend on earlier results). * **Coherence gaps**: Approximately 40% of cases show slightly worse fluency vs. sequential generation. * **Prompt sensitivity**: Effectiveness depends on decomposable question structure; non-list-like answers benefit less. * **Not universal**: Questions requiring deep sequential reasoning (e.g., proofs) are better served by standard decoding. ===== References ===== * [[https://arxiv.org/abs/2307.15337|Ning et al., "Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding", arXiv:2307.15337 (2023)]] * [[https://www.microsoft.com/en-us/research/blog/skeleton-of-thought-parallel-decoding-speeds-up-and-improves-llm-output/|Microsoft Research Blog]] ===== See Also ===== * [[step_back_prompting|Step-Back Prompting]] * [[least_to_most_prompting|Least-to-Most Prompting]] * [[tool_learning_foundation_models|Tool Learning with Foundation Models]]