====== Text Generation Inference ====== **Text Generation Inference (TGI)** is an open-source, high-performance inference server developed by [[hugging_face|Hugging Face]] for deploying and serving large language models in production. Built as a hybrid Rust/Python implementation, it features tensor parallelism, continuous batching, token streaming, and Flash Attention integration.((source [[https://github.com/huggingface/text-generation-inference|TGI GitHub Repository]])) ===== Architecture ===== TGI uses a multi-component architecture consisting of a launcher, server, and client tooling, exposing both REST and gRPC APIs:((source [[https://oneuptime.com/blog/post/2026-02-09-huggingface-tgi-kubernetes/view|TGI on Kubernetes - OneUptime]])) * **Launcher** -- handles model downloading, sharding, and process management * **Server** -- Rust-based HTTP server with Python inference backend * **Router** -- distributes incoming requests across model shards * **Tensor parallelism** -- distributes model layers across multiple GPUs for large models Supported models include Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, StableLM, Flan-T5, and others from the Hugging Face Hub. ===== Flash Attention ===== TGI incorporates Flash Attention v2 for faster inference on supported architectures.((source [[https://github.com/huggingface/text-generation-inference|TGI GitHub Repository]])) Flash Attention reduces memory access overhead by: * Computing attention in tiles that fit in GPU SRAM * Avoiding materialization of the full attention matrix * Reducing memory complexity from O(n^2) to O(n) for sequence length n This optimization applies to models like BLOOM, T5, GPT-NeoX, StarCoder, and Llama. ===== Quantization Support ===== TGI natively supports multiple quantization methods to reduce memory footprint:((source [[https://oneuptime.com/blog/post/2026-02-09-huggingface-tgi-kubernetes/view|TGI on Kubernetes - OneUptime]])) * **GPTQ** -- post-training quantization with configurable bit widths * **AWQ** -- activation-aware weight quantization * **GGML** -- llama.cpp-compatible quantized formats * **bitsandbytes** -- dynamic quantization (4-bit, 8-bit) Quantization is toggled via environment variables (''QUANTIZE=gptq'' or ''QUANTIZE=awq'') or command-line flags. ===== Production Deployment ===== TGI is designed for production environments with multiple deployment patterns:((source [[https://oneuptime.com/blog/post/2026-02-09-huggingface-tgi-kubernetes/view|TGI on Kubernetes - OneUptime]])) **Docker:** * Image: ''ghcr.io/huggingface/text-generation-inference:2.0'' * GPU passthrough with ''%%--gpus all%%'' * Volume mounts for model caching (''/cache'') * Shared memory mounting (''/dev/shm'') **Kubernetes:** * Replicas with GPU node selectors * PersistentVolumeClaims for model caching * Liveness/readiness probes (startup delay for model download) * Environment variables for tuning: ''MAX_CONCURRENT_REQUESTS'', ''MAX_BATCH_TOTAL_TOKENS'', ''MAX_BATCH_PREFILL_TOKENS'' **AWS SageMaker:** * Hugging Face Deep Learning Containers with TGI backend * Automatic health checks and monitoring * ''SM_NUM_GPUS'' for multi-GPU configuration ===== Performance Features ===== * **Continuous batching** -- dynamically groups requests for maximum GPU utilization * **Token streaming** -- real-time response streaming for low-latency chat and RAG applications * **Paged attention** -- efficient KV cache management * **Prometheus metrics** -- built-in observability for throughput and latency monitoring * **Tracing** -- request-level tracing for debugging ===== TGI vs vLLM ===== ^ Feature ^ TGI ^ vLLM ^ | Batching | Continuous/dynamic | PagedAttention + continuous | | Quantization | GPTQ, AWQ, GGML, bitsandbytes | AWQ, GPTQ, SqueezeLLM | | Attention | Flash Attention v2 | Custom CUDA kernels | | Deployment | Kubernetes, SageMaker, HF ecosystem | Standalone, Kubernetes | | Ecosystem | Native HF Hub integration | Framework-agnostic | | Best For | HF-native production deployments | Raw throughput optimization | ===== See Also ===== * [[vllm|vLLM]] * [[hugging_face|Hugging Face]] * [[llama_cpp|llama.cpp]] * [[ollama|Ollama]] ===== References =====