Table of Contents

How to Self-Host an LLM

Self-hosting gives you full control over your LLM infrastructure – data stays on your network, per-token costs drop at scale, and you can customize models freely. This guide covers hardware selection, inference engines, deployment, and cost trade-offs.

Hardware Requirements

LLM inference is memory-bandwidth bound, not compute bound. Prioritize VRAM capacity and bandwidth over raw FLOPS.

VRAM by Model Size

Model Size VRAM (4-bit Quantized) VRAM (FP16) System RAM
7B 8-12 GB 14-16 GB 32 GB
13B 16 GB 26-30 GB 32 GB
34B 20-24 GB 68 GB 64 GB
70B 24-40 GB 140 GB 64-128 GB

A 7B parameter model at 4-bit quantization runs comfortably on a 12GB GPU. A 70B model needs 24-40GB depending on quantization level, or can be split across multiple GPUs. 1)

GPU Selection

GPU VRAM Bandwidth Price Range Best For
RTX 4080 16 GB 717 GB/s $700-900 Budget: 7B models
RTX 4090 24 GB 1,008 GB/s $1,100-1,200 Sweet spot: 7-34B models
RTX 5090 32 GB 1,792 GB/s $2,000+ Enthusiast: up to 70B quantized
A100 40-80 GB 2,039 GB/s Cloud only Production multi-user serving
H100 80 GB 3,350 GB/s Enterprise High-throughput production

The RTX 4090 is the consumer sweet spot – 24GB VRAM at high bandwidth for around $1,100 used. For production workloads requiring concurrency, A100 or H100 instances via cloud providers are the standard choice. 2)

Inference Engines

Engine Best For Key Features GPU Support
Ollama Local development, beginners One-command setup, auto-quantization NVIDIA, AMD, Apple Silicon
vLLM Production serving Tensor parallelism, PagedAttention, high throughput NVIDIA multi-GPU
TGI (Text Generation Inference) Enterprise serving High concurrency, Hugging Face integration NVIDIA
llama.cpp Maximum efficiency CPU+GPU hybrid, GGUF quantization, low overhead All platforms
LocalAI Docker-first API serving OpenAI-compatible API, model-agnostic GPU passthrough

Ollama is the best starting point for local experimentation. vLLM is the production standard for multi-user serving with its PagedAttention memory management. llama.cpp gives the best performance per watt for single-user inference. 3)

Quantization

Quantization reduces model precision to shrink VRAM requirements with minimal quality loss:

Format Used By Description
GGUF Ollama, llama.cpp Community standard for CPU/GPU hybrid inference
GPTQ vLLM, TGI GPU-optimized post-training quantization
AWQ vLLM Activation-aware quantization, preserves important weights

The Q4_K_M quantization level offers the best balance of quality and size for most use cases. More aggressive quantization (Q2, Q3) saves VRAM but degrades output quality noticeably. 4)

Docker Deployment

Ollama via Docker

docker run -d --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Pull and run a model
docker exec -it ollama ollama run llama3:8b

vLLM via Docker

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai \
  --model meta-llama/Llama-3-70B \
  --quantization awq \
  --tensor-parallel-size 2

LocalAI via Docker

docker run -p 8080:8080 --gpus all \
  localai/localai:latest-aio-cuda12

Prerequisites for GPU passthrough: NVIDIA drivers, CUDA toolkit, and the NVIDIA Container Toolkit installed on the host. 5)

Cost Comparison

Option Monthly Cost Pros Cons
RTX 4090 (home) ~$10 electricity | Full privacy, lowest latency | Upfront $1,800, noise/heat
Cloud GPU (Hetzner 4080) ~$160/mo No hardware management Datacenter latency
API (GPT-4o equivalent) $300-500/mo at moderate use Zero infrastructure Data leaves your network

Self-hosting breaks even versus API costs within 3-6 months for heavy usage (>1M tokens/day). The primary advantage beyond cost is data sovereignty – your prompts and responses never leave your network. 6)

Performance Optimization

Model Parameters Quality Notes
Llama 3.1 8B / 70B Near GPT-4 at 70B Best general-purpose open model
Qwen 2.5 7B / 72B Strong multilingual Excellent for non-English
Mixtral 8x22B 141B (sparse) GPT-4 class Mixture of experts, efficient inference
DeepSeek V3 671B (sparse) Frontier class Requires multi-GPU

7)

See Also

References