====== Autoregressive (AR) Decoding ====== **Autoregressive (AR) Decoding** refers to the sequential token generation approach used in traditional language models, where each output token is generated one at a time based on the complete history of all previously generated tokens. This decoding strategy has become the foundational mechanism for modern large language models, enabling them to produce coherent and contextually appropriate text through a step-by-step generation process. ===== Overview and Definition ===== Autoregressive decoding operates on the principle of conditional probability, where the model predicts the next token given all preceding tokens in the sequence. Mathematically, this follows the chain rule of probability: P(x₁, x₂, ..., xₙ) = ∏ᵢ P(xᵢ|x₁, x₂, ..., xᵢ₋₁). At each timestep, the model processes the full context window and generates a probability distribution over the vocabulary, from which a token is sampled or selected. This process continues sequentially until a stop token is generated or a maximum length is reached (([[https://arxiv.org/abs/1706.03762|Vaswani et al. - Attention Is All You Need (2017]])) The term "autoregressive" stems from signal processing and statistics, where autoregressive models predict future values based on past observations. In the context of language models, this translates to generating future tokens based on previously generated text, creating a natural left-to-right generation pattern that aligns with how humans read and write. ===== Technical Implementation ===== The implementation of AR decoding in transformer-based models involves several key components. The model maintains a **context window** containing all previously generated tokens, which grows with each generation step. At each timestep t, the model: 1. Encodes the current sequence [x₁, x₂, ..., xₜ₋₁] through the transformer stack 2. Applies attention mechanisms that compute relationships between all tokens 3. Outputs logits for position t-1, representing unnormalized probabilities over vocabulary tokens 4. Applies temperature scaling and sampling techniques (top-k, top-p, greedy selection) to select xₜ 5. Appends xₜ to the sequence and repeats This sequential processing creates a strict dependency: token generation at step t requires the complete output from step t-1. Modern implementations use **key-value caching** (KV-cache) to avoid redundant computation—previously computed attention keys and values are stored and reused, reducing computational complexity from O(n²) to O(n) for the n-th token generation step (([[https://arxiv.org/abs/2211.05102|Shazeer - Fast Transformer Decoding: One Write-Head is All You Need (2022]])) Temperature and sampling parameters significantly affect output quality. Lower temperatures (approaching 0) make the model more deterministic and focused on high-probability tokens, while higher temperatures increase diversity and creativity. Nucleus sampling (top-p) and top-k sampling strategies help prevent poor token selections while maintaining stochasticity. ===== Advantages and Applications ===== Autoregressive decoding offers several fundamental advantages. **Introspective consistency** ensures that each generated token is conditioned on the complete history, allowing the model to maintain coherence across long sequences. This makes AR decoding particularly effective for tasks requiring logical continuity, such as multi-step reasoning, code generation, and narrative writing. AR decoding has become the standard in production language models including GPT series, Claude, and Gemini variants, demonstrating robust performance across diverse applications: question answering, dialogue systems, summarization, translation, and creative writing (([[https://arxiv.org/abs/2005.14165|Brown et al. - Language Models are Few-Shot Learners (2020]])) The approach's flexibility in sampling strategies enables diverse use cases. Deterministic greedy decoding suits applications requiring reproducibility, while diverse sampling improves creative tasks. This adaptability has made AR decoding the default choice for commercial deployments. ===== Limitations and Sequential Bottleneck ===== The primary limitation of AR decoding is the **sequential bottleneck**: tokens must be generated one at a time, making inference speed dependent on output length. Generating 100 tokens requires 100 forward passes through the model, each incurring full computational overhead despite operating on increasingly redundant context. This creates latency issues for real-time applications and limits throughput in high-concurrency scenarios (([[https://arxiv.org/abs/2302.01318|Leviathan et al. - Fast Inference from Transformers via Speculative Decoding (2023]])) This constraint becomes particularly acute in long-context scenarios where attention computation scales quadratically with sequence length. For applications requiring thousand-token or longer outputs, inference time becomes prohibitive even with modern acceleration techniques. Additional limitations include: * **Exposure bias**: Models trained with teacher forcing (always conditioning on ground truth) may perform suboptimally during inference when forced to condition on generated tokens * **Greedy decoding suboptimality**: Token-level greedy selection does not guarantee globally optimal sequences * **Context window limitations**: Finite context windows restrict the model's ability to maintain coherence over extremely long documents ===== Recent Research Directions ===== Recent work addresses AR decoding limitations through alternative and complementary approaches. **Speculative decoding** uses smaller draft models to propose multiple tokens, which are validated by the larger model in parallel, achieving 2-3x speedups while maintaining output quality (([[https://arxiv.org/abs/2302.01318|Leviathan et al. - Fast Inference from Transformers via Speculative Decoding (2023]])). **Non-autoregressive and parallel decoding** methods generate multiple tokens simultaneously, though these typically require specialized training and show quality trade-offs. **Mixture-of-Depths** and token pruning approaches reduce redundant computation by allowing models to skip processing for less critical tokens. ===== See Also ===== * [[sequential_vs_parallel_generation|Sequential vs Parallel Token Generation]] * [[dllms_vs_autoregressive_methods|dLLMs vs Autoregressive Methods]] * [[rag_in_ai|Retrieval-Augmented Generation (RAG) in AI]] ===== References =====