Table of Contents

Text Generation Inference

Text Generation Inference (TGI) is an open-source, high-performance inference server developed by Hugging Face for deploying and serving large language models in production. Built as a hybrid Rust/Python implementation, it features tensor parallelism, continuous batching, token streaming, and Flash Attention integration.1)

Architecture

TGI uses a multi-component architecture consisting of a launcher, server, and client tooling, exposing both REST and gRPC APIs:2)

Supported models include Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, StableLM, Flan-T5, and others from the Hugging Face Hub.

Flash Attention

TGI incorporates Flash Attention v2 for faster inference on supported architectures.3) Flash Attention reduces memory access overhead by:

This optimization applies to models like BLOOM, T5, GPT-NeoX, StarCoder, and Llama.

Quantization Support

TGI natively supports multiple quantization methods to reduce memory footprint:4)

Quantization is toggled via environment variables (QUANTIZE=gptq or QUANTIZE=awq) or command-line flags.

Production Deployment

TGI is designed for production environments with multiple deployment patterns:5)

Docker:

Kubernetes:

AWS SageMaker:

Performance Features

TGI vs vLLM

Feature TGI vLLM
Batching Continuous/dynamic PagedAttention + continuous
Quantization GPTQ, AWQ, GGML, bitsandbytes AWQ, GPTQ, SqueezeLLM
Attention Flash Attention v2 Custom CUDA kernels
Deployment Kubernetes, SageMaker, HF ecosystem Standalone, Kubernetes
Ecosystem Native HF Hub integration Framework-agnostic
Best For HF-native production deployments Raw throughput optimization

See Also

References