Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Browse
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety
Meta
Load balancing is a critical infrastructure component for distributing incoming requests across multiple servers or pods in a cluster. Two primary approaches—round-robin and power-of-two choices—represent fundamentally different strategies for request routing, with significant implications for system performance under high load. Understanding the tradeoffs between these algorithms is essential for designing inference platforms and services that maintain consistent latency and throughput at scale.
Round-robin load balancing is the simplest and most widely deployed load balancing algorithm. It distributes incoming requests sequentially across available backend servers or pods in a cyclic manner, assigning the next request to the next server in the list regardless of current server state 1).
The algorithm operates with O(1) time complexity and minimal computational overhead, making it extremely efficient to implement. Round-robin requires no monitoring of individual server load or response times—only a simple counter that cycles through the available pool of servers. This simplicity made round-robin the default choice in many systems, including Kubernetes' native service load balancing 2).
However, round-robin's stateless nature creates a critical limitation: it assumes all backend servers have equal capacity and are equally available. In practice, server processing times vary due to cache effects, CPU scheduling, network latency, and request complexity variations. When some servers become slower than others—either temporarily or due to uneven workload patterns—round-robin continues distributing requests equally, creating hotspots where slower servers accumulate requests while faster servers remain underutilized.
Power-of-two load balancing addresses round-robin's limitations by introducing minimal state awareness while maintaining computational efficiency. For each incoming request, the algorithm samples two candidate backend servers randomly from the available pool and routes the request to the server with fewer currently active requests 3).
This approach combines the speed of round-robin with load-aware routing. By comparing only two randomly selected servers rather than evaluating all servers, the algorithm maintains O(1) complexity while providing significant load balancing improvements. The mathematical principle underlying power-of-two choices relies on the “power of two random choices”—a probabilistic result showing that random selection of two items and choosing the better one dramatically outperforms random selection alone.
The practical implementation requires tracking active request counts per pod or server, which can be maintained efficiently through simple increment/decrement operations at request initiation and completion. This state overhead remains negligible compared to the load balancing improvements achieved.
The performance differences between these algorithms become pronounced at high query-per-second (QPS) rates. Kubernetes' default round-robin load balancing creates uneven request distribution and persistent hotspots when QPS exceeds 200,000 requests per second 4).
At these scales, tail latency—measured at the 99th percentile (P99)—becomes the dominant performance metric. Round-robin's hotspots cause request queuing on slower servers, spiking tail latencies well above target thresholds. Power-of-two choices prevents these hotspots by dynamically routing requests away from overloaded servers, maintaining more uniform queue depths across the server pool.
Production testing at Superhuman and Databricks demonstrated that power-of-two choices algorithm is necessary to meet sub-second P99 latency targets in high-throughput inference environments 5).
Selecting between these algorithms depends on deployment scale and latency requirements. Round-robin remains appropriate for moderate-traffic systems (under 50K QPS) where tail latency is less critical and implementation simplicity is valued. The algorithm also performs adequately when backend servers have homogeneous performance characteristics and request processing times are consistent.
Power-of-two choices becomes essential for high-throughput inference platforms, real-time bidding systems, and microservice architectures handling 200K+ QPS. The minimal additional overhead—mainly tracking active request counts and two random selections—is negligible compared to performance gains.
Hybrid approaches also exist. Some systems implement power-of-two choices for initial routing decisions but fall back to round-robin for within-server thread scheduling, balancing sophistication with implementation complexity.
Power-of-two choices assumes accurate, real-time knowledge of server load. In distributed systems with network delays, load information may become stale, reducing algorithm effectiveness. More sophisticated algorithms like weighted least connections or response-time aware routing address these limitations but introduce additional complexity.
Recent work has explored dynamic power-of-k choices, where the number of samples varies based on current system load, and multi-level hierarchical load balancing for geographically distributed systems. These approaches extend the power-of-two principle to additional dimensions while maintaining computational efficiency.