====== Markovian RSA ======
**Markovian RSA** is a test-time inference optimization technique designed to reduce computational requirements during model inference while maintaining performance quality. The method was introduced as a key component of Zyphra's ZAYA1-8B model, achieving active parameter utilization of under 1 billion parameters despite the larger 8-billion parameter model architecture (([[https://news.smol.ai/issues/26-05-07-not-much/|AI News - Zyphra ZAYA1-8B Model Overview (2026]])).

===== Overview and Purpose =====
Markovian RSA addresses a fundamental challenge in large language model deployment: the computational cost of inference scales linearly with model size, creating bottlenecks for real-time applications and resource-constrained environments. Rather than reducing model capacity or accepting higher latency, Markovian RSA operates at test time to intelligently route computation, activating only necessary parameters during inference (([[https://news.smol.ai/issues/26-05-07-not-much/|AI News - Zyphra ZAYA1-8B Model Overview (2026]])). This approach maintains the benefits of larger models—including better contextual understanding and reasoning capability—while achieving efficiency comparable to substantially smaller models.

===== Technical Framework =====
The method employs reinforcement learning at scale to determine which parameters should be active for specific input tokens and computational steps. The "Markovian" designation indicates that activation decisions depend primarily on the current state rather than requiring information about the [[entire|entire]] sequence history, reducing memory overhead and enabling efficient streaming inference. By maintaining under 1 billion active parameters across the 8-billion parameter model, the technique achieves approximately 87.5% parameter sparsity during typical inference workloads.

The approach combines parameter routing mechanisms with RL-based optimization to learn which subsets of the model are most valuable for different input patterns. This differs from traditional mixture-of-experts architectures by using learned, dynamic routing policies optimized through reinforcement learning rather than fixed expert selection mechanisms. The technique operates within a Markovian decision framework where each token generation step makes activation choices based on local context (([[https://news.smol.ai/issues/26-05-07-not-much/|AI News - Zyphra ZAYA1-8B Model Overview (2026]])). 

===== Implementation and Applications =====
[[zyphra|Zyphra]]'s implementation in ZAYA1-8B demonstrates practical viability for production inference systems. The model uses Markovian RSA to balance inference latency, memory consumption, and output quality across diverse tasks. Applications include real-time conversational AI, streaming content generation, and edge deployment scenarios where compute resources are limited. The technique enables serving larger models on constrained hardware while maintaining response times competitive with smaller, purpose-built models.

The RL-based training of activation policies requires substantial computational resources during model development, but the inference-time benefits justify this upfront investment. The approach appears particularly effective for tasks where different input patterns activate different reasoning pathways, allowing the model to allocate computational resources adaptively rather than uniformly.

===== Performance Characteristics =====
Under the Markovian RSA approach, inference throughput improves substantially compared to standard dense inference on the full 8-billion parameter model. Latency reduction comes from two sources: fewer active parameters reducing computation per token, and the Markovian framework avoiding sequence-length dependent routing overhead. Memory bandwidth requirements similarly decrease proportionally to active parameter count, enabling deployment on lower-tier hardware accelerators (([[https://news.smol.ai/issues/26-05-07-not-much/|AI News - Zyphra ZAYA1-8B Model Overview (2026]])). 

The technique maintains model quality by using RL to learn which parameters are most critical for accurate predictions. Rather than using static sparsity patterns, the learned routing policies adapt to input content, preserving performance on complex reasoning tasks while achieving efficiency on simpler patterns.

===== Challenges and Limitations =====
The primary limitation of Markovian RSA involves the complexity of training the routing policies through large-scale reinforcement learning. The approach requires careful reward design to balance inference efficiency with output quality, and the learned policies may not generalize perfectly to input distributions substantially different from training data. Additionally, the sequential nature of token generation means that routing decisions accumulate errors across long sequences, potentially degrading quality for extended outputs (([[https://news.smol.ai/issues/26-05-07-not-much/|AI News - Zyphra ZAYA1-8B Model Overview (2026]])). 

Batch inference performance may vary depending on whether different inputs trigger similar activation patterns, affecting effective hardware utilization. The approach also requires careful implementation to avoid dynamic shape changes that can create inefficiencies in typical inference frameworks.

===== Related Concepts =====
Markovian RSA builds on established techniques in model compression and conditional computation. It relates to //mixture of experts// architectures that selectively activate different model components, //neural architecture search// methods that optimize which components to use, and //reinforcement learning from human feedback// (RLHF) insofar as RL provides the optimization signal for learning routing policies. The approach also connects to broader work in //sparse model inference// and //dynamic neural networks// that adapt computation based on input characteristics.


===== See Also =====

  * [[inference_latency_optimization|Inference Latency Optimization]]
  * [[zaya1_8b|ZAYA1-8B]]
  * [[test_time_scaling|Test-Time Scaling]]

===== References =====