Text Generation Inference (TGI) is an open-source, high-performance inference server developed by Hugging Face for deploying and serving large language models in production. Built as a hybrid Rust/Python implementation, it features tensor parallelism, continuous batching, token streaming, and Flash Attention integration.1)
TGI uses a multi-component architecture consisting of a launcher, server, and client tooling, exposing both REST and gRPC APIs:2)
Supported models include Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, StableLM, Flan-T5, and others from the Hugging Face Hub.
TGI incorporates Flash Attention v2 for faster inference on supported architectures.3) Flash Attention reduces memory access overhead by:
This optimization applies to models like BLOOM, T5, GPT-NeoX, StarCoder, and Llama.
TGI natively supports multiple quantization methods to reduce memory footprint:4)
Quantization is toggled via environment variables (QUANTIZE=gptq or QUANTIZE=awq) or command-line flags.
TGI is designed for production environments with multiple deployment patterns:5)
Docker:
ghcr.io/huggingface/text-generation-inference:2.0--gpus all/cache)/dev/shm)Kubernetes:
MAX_CONCURRENT_REQUESTS, MAX_BATCH_TOTAL_TOKENS, MAX_BATCH_PREFILL_TOKENSAWS SageMaker:
SM_NUM_GPUS for multi-GPU configuration| Feature | TGI | vLLM |
|---|---|---|
| Batching | Continuous/dynamic | PagedAttention + continuous |
| Quantization | GPTQ, AWQ, GGML, bitsandbytes | AWQ, GPTQ, SqueezeLLM |
| Attention | Flash Attention v2 | Custom CUDA kernels |
| Deployment | Kubernetes, SageMaker, HF ecosystem | Standalone, Kubernetes |
| Ecosystem | Native HF Hub integration | Framework-agnostic |
| Best For | HF-native production deployments | Raw throughput optimization |