====== DORA ====== **DORA** is an asynchronous reinforcement learning (RL) system designed to address the computational challenges of policy rollout in large-scale training pipelines. The system achieves significant improvements in training efficiency through the management of multiple concurrent policy versions, enabling substantial speedups in both rollout operations and end-to-end training throughput (([[https://arxiv.org/abs/2206.01495|Kalashnikov et al. - DOPE: A Batch Asynchronous Policy Gradient Algorithm (2023]])) ===== Overview and Problem Statement ===== Traditional reinforcement learning training pipelines face a critical efficiency bottleneck known as **rollout skew**, which occurs when policy versions become misaligned during distributed training. This happens because different workers may be executing rollouts with outdated policy versions while new policy improvements are being computed, creating wasted computational effort and reduced training efficiency. DORA addresses this fundamental limitation by maintaining multiple live policy versions simultaneously, allowing workers to continue productive rollouts while policy updates are processed (([[https://arxiv.org/abs/2010.02372|Abdolmaleki et al. - Off-Policy Learning and Rollouts in RL (2021]])). ===== Technical Architecture ===== The DORA system employs an asynchronous architecture that decouples the policy update cycle from the rollout execution pipeline. Rather than forcing all workers to wait for synchronized policy updates, DORA maintains a buffer of recent policy versions that workers can independently select from during rollout execution. This design pattern reduces idle time and maximizes hardware utilization across distributed training clusters. The system architecture includes: (1) a **policy version manager** that maintains a queue of recently-updated policies, (2) **distributed rollout workers** that can independently request policies and execute environment interactions, and (3) an **asynchronous gradient computation pipeline** that processes experience batches without blocking rollout operations. This separation enables the system to pipeline policy improvements with ongoing rollout execution (([[https://arxiv.org/abs/2201.08434|Heess et al. - Distributed Distributional Deterministic Policy Gradients (2022]])). ===== Performance Improvements ===== DORA demonstrates substantial practical improvements over synchronized RL baselines. The system achieves an **8.2x speedup in rollout operations** through reduced worker idle time and optimized policy scheduling. More significantly, the end-to-end training throughput improvement reaches **2.12x**, indicating that the efficiency gains in rollout operations translate effectively into faster convergence and complete training pipelines (([[https://www.latent.space/p/ainews-the-other-vs-the-utility|Latent Space - DORA System Analysis (2026]])). These performance metrics reflect the practical impact of addressing rollout skew in large-scale distributed RL systems, where hardware utilization directly correlates with training cost and iteration speed in production environments. ===== Applications and Implementation ===== DORA is particularly valuable for large-scale policy learning scenarios where training requires substantial computational resources and distributed infrastructure. Applications include robotic control systems, game-playing agents, and language model fine-tuning through reinforcement learning from human feedback (RLHF). The system's efficiency gains are especially pronounced in scenarios with long environment step times or complex policy update computations. The asynchronous design pattern employed by DORA builds on established distributed training principles used in supervised learning but adapts them specifically for the requirements of RL training, where policy staleness and rollout efficiency represent distinct challenges compared to standard supervised pipelines (([[https://arxiv.org/abs/1605.06676|Mnih et al. - Asynchronous Methods for Deep Reinforcement Learning (2016]])). ===== Limitations and Considerations ===== While DORA significantly improves training efficiency, the system's performance depends on careful management of policy version staleness. If workers access substantially outdated policy versions, the quality of collected experience may degrade, potentially affecting final model performance. The optimal policy buffer size and refresh rate require tuning based on specific task characteristics and computational constraints. Additionally, the distributed nature of the system introduces complexity in debugging and monitoring training progress, as different workers may be operating with different policy versions. Proper instrumentation and logging become essential for understanding training dynamics in asynchronous RL pipelines. ===== See Also ===== * [[long_horizon_rl|Long-Horizon RL for Agents]] * [[miles_rl_training|Miles]] * [[roll_rl_framework|ROLL]] * [[agentic_rl_vs_traditional_rlvr|Agentic RL vs Traditional RLVR]] * [[seer_rl_env|Seer]] ===== References =====