====== Tensor Sequence Parallelism (TSP) ======
**Tensor Sequence Parallelism (TSP)** is a distributed training and inference technique developed by Zyphra that optimizes GPU memory utilization during large language model workloads. TSP combines aspects of tensor parallelism and sequence parallelism to reduce per-GPU peak memory requirements compared to employing these techniques sequentially, enabling more efficient scaling across large GPU clusters.(([[https://www.latent.space/p/ainews-the-other-vs-the-utility|Latent Space (2026]]))


===== Overview and Motivation =====
Distributed training of [[large_language_models|large language models]] requires sophisticated parallelism strategies to manage computational load and memory constraints across multiple GPUs. Traditional approaches employ either **tensor parallelism**, which distributes model weights across GPUs, or **sequence parallelism**, which distributes sequence tokens across devices (([https://arxiv.org/abs/2205.05198|Li et al. - Sequence Parallelism: Make Transformer Distributed Training Friendly (2022)]))).

TSP represents an integrated approach that simultaneously leverages both parallelism dimensions. The technique reduces peak memory per GPU compared to stacking tensor parallelism and sequence parallelism sequentially, which would require separate memory overhead for each technique. This improvement enables either larger effective batch sizes, longer context lengths, or deployment on smaller GPU clusters while maintaining comparable throughput (([https://www.latent.space/p/ainews-the-other-vs-the-utility|Latent Space - AI News Coverage (2026)])).

===== Technical Architecture =====
TSP operates by interleaving tensor and sequence parallelism operations within the computational graph rather than treating them as sequential stages. In standard tensor parallelism, model parameters are split across devices in a way that requires all-reduce communications during gradient computation. Sequence parallelism distributes the sequence dimension, requiring additional data reshuffling and communication patterns.

The integrated TSP approach coordinates these two parallelism dimensions to minimize intermediate activation storage and communication overhead. Instead of maintaining full activations for both parallelism schemes, TSP reduces redundant buffering by carefully ordering and combining the distributed operations. This design allows the system to avoid storing large intermediate states that would be necessary if tensor parallelism and sequence parallelism were applied in series (([https://arxiv.org/abs/2310.00113|Korthikanti et al. - Reducing Activation Recomputation in Large Transformer Models (2023)])).

===== Performance Characteristics =====
Zyphra's implementation of TSP demonstrated significant performance metrics on large-scale deployments. The technique achieved **173 million tokens per second** throughput on a cluster of 1,024 MI300X GPUs while maintaining a context window of 128K tokens. This performance represents effective hardware utilization across a substantial GPU cluster, indicating that TSP enables practical deployment of very large models with extended context lengths. Zyphra characterizes TSP as "folded Tensor and Sequence Parallelism," emphasizing the integrated nature of the approach (([[https://www.latent.space/p/ainews-the-other-vs-the-utility|Latent Space (2026]]))

The memory efficiency gains enable several practical benefits: reduced per-GPU memory pressure, support for larger batch sizes during training, ability to maintain longer sequence contexts during inference, or deployment on clusters with smaller individual GPU memory capacity. These characteristics make TSP particularly valuable for training frontier-scale language models and serving high-throughput inference workloads simultaneously (([https://www.latent.space/p/ainews-the-other-vs-the-utility|Latent Space - Zyphra Technical Implementation (2026)])).

===== Applications and Use Cases =====
TSP is primarily designed for training and inference of large language models at scale. The technique addresses bottlenecks encountered when training models with hundreds of billions of parameters across distributed clusters. By reducing per-GPU memory requirements, TSP enables more flexible cluster configurations and potentially reduces infrastructure costs associated with large-scale AI workloads.

The 128K context window capability suggests TSP is designed to support modern long-context language models, which require distributed sequence handling to fit within GPU memory limits. This makes TSP relevant for deploying models that need to process extended documents, code repositories, or multi-turn conversation histories.

===== Comparison with Related Techniques =====
TSP differs from earlier parallelism approaches by integrating tensor and sequence parallelism within a unified computational framework. Prior work in distributed training typically applied these techniques sequentially or in isolation (([https://arxiv.org/abs/2205.05198|Li et al. - Sequence Parallelism: Make Transformer Distributed Training Friendly (2022)])). Other related approaches include pipeline parallelism, which divides the model vertically across devices, and data parallelism, which replicates the model across devices with split batches.

The key innovation in TSP is recognizing that tensor and sequence parallelism can be coordinated more efficiently than sequential composition, reducing the multiplicative memory overhead that would result from combining both approaches independently.

===== Limitations and Considerations =====
TSP requires careful implementation and coordination across the distributed system. The technique depends on specific hardware configurations and communication patterns, potentially limiting flexibility in cluster design. Achieving the reported performance metrics requires tuning for specific GPU types and network configurations.

Memory savings depend on the specific model architecture and sequence characteristics. Models with significantly different tensor and sequence dimensions may see varying benefits from TSP compared to alternative parallelism strategies. Additionally, the communication overhead reduction depends on efficient collective communication operations across the GPU cluster, requiring well-optimized NCCL or similar communication libraries.

===== Current Status and Future Development =====
TSP represents active research and development in distributed AI systems, with Zyphra demonstrating the technique at production scale in 2026. As language models continue to increase in scale and context window requirements, parallelism innovations like TSP are likely to become increasingly important for practical model deployment. Future developments may extend TSP to heterogeneous GPU clusters or optimize it further for specific architectural patterns in newer transformer variants.


===== See Also =====
  * [[gpu_parallelization|GPU Parallelization]]
  * [[multi_token_prediction|Multi-Token Prediction (MTP)]]
  * [[sglang|SGLang]]

===== References =====