Table of Contents

Skeleton-of-Thought (SoT)

Skeleton-of-Thought (SoT) is a decoding acceleration method introduced by Ning et al. in 2023 that reduces LLM generation latency by up to 2.39x through a two-stage process: first generating a concise answer skeleton, then expanding each point in parallel. The method requires no model modifications or specialized hardware – it works as a pure prompting and orchestration strategy.

Overview

Standard LLM inference is inherently sequential: tokens are generated one at a time in autoregressive fashion, making latency proportional to output length. SoT breaks this bottleneck by observing that many answers are naturally decomposable into independent points that can be expanded concurrently. This mirrors how humans often outline their thoughts before elaborating.

Two-Stage Process

SoT operates in two stages:

  1. Stage 1 – Skeleton Generation: The LLM generates a concise outline of 3-10 key points covering the main aspects of the answer, without details or filler. This is a short generation that completes quickly.
  2. Stage 2 – Parallel Point Expansion: Each skeleton point is expanded independently and concurrently. For API-based models (e.g., GPT-4), this uses parallel API calls. For open-source models (e.g., LLaMA), this uses batched decoding on the GPU.

The final answer concatenates all expanded points, optionally with a refinement pass for fluency.

Latency Analysis

For standard sequential decoding, latency scales linearly with output length:

$$T_{\text{seq}}(N) \approx N \cdot t$$

where $N$ is the number of output tokens and $t$ is per-token generation time. With SoT and $K$ parallel expansion points:

$$T_{\text{SoT}} \approx T_{\text{skeleton}} + \max_{i \in \{1,\ldots,K\}} T_{\text{point}_i}$$

For balanced point lengths, this yields approximately $T_{\text{seq}} / K$ speedup. In practice, the skeleton is short (typically under 100 tokens), making $T_{\text{skeleton}}$ negligible relative to expansion time.

The Router Module

SoT includes a router that dynamically selects expansion prompts based on point type:

This improves expansion quality by avoiding generic one-size-fits-all prompts.

Code Example

import asyncio
import openai
 
async def skeleton_of_thought(query, client):
    # Stage 1: Generate skeleton
    skeleton_prompt = (
        f"Question: {query}\n"
        "Give a skeleton outline of your answer with 3-7 key points. "
        "Each point should be a concise phrase, one per line. "
        "Do not elaborate -- just the skeleton."
    )
    skeleton_resp = await client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": skeleton_prompt}]
    )
    points = [
        p.strip() for p in skeleton_resp.choices[0].message.content.split("\n")
        if p.strip()
    ]
 
    # Stage 2: Expand points in parallel
    async def expand_point(point, idx):
        expand_prompt = (
            f"Question: {query}\n"
            f"Expand on this specific point with 2-3 detailed sentences:\n"
            f"Point {idx + 1}: {point}"
        )
        resp = await client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": expand_prompt}]
        )
        return resp.choices[0].message.content
 
    expansions = await asyncio.gather(
        *[expand_point(p, i) for i, p in enumerate(points)]
    )
 
    # Combine
    return "\n\n".join(
        f"**{point}**\n{expansion}"
        for point, expansion in zip(points, expansions)
    )

Experimental Results

Evaluated on the Topical-200 benchmark across 12 LLMs:

Model Speedup Notes
GPT-4 2.0x Parallel API calls
Vicuna-13B 2.4x Batched GPU decoding
LLaMA-65B 2.3x Larger models benefit more
Average (12 models) 1.9-2.39x Consistent across architectures

Quality assessment (1800 human-evaluated answers):

Limitations

References

See Also