Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Self-hosting gives you full control over your LLM infrastructure – data stays on your network, per-token costs drop at scale, and you can customize models freely. This guide covers hardware selection, inference engines, deployment, and cost trade-offs.
LLM inference is memory-bandwidth bound, not compute bound. Prioritize VRAM capacity and bandwidth over raw FLOPS.
| Model Size | VRAM (4-bit Quantized) | VRAM (FP16) | System RAM |
|---|---|---|---|
| 7B | 8-12 GB | 14-16 GB | 32 GB |
| 13B | 16 GB | 26-30 GB | 32 GB |
| 34B | 20-24 GB | 68 GB | 64 GB |
| 70B | 24-40 GB | 140 GB | 64-128 GB |
A 7B parameter model at 4-bit quantization runs comfortably on a 12GB GPU. A 70B model needs 24-40GB depending on quantization level, or can be split across multiple GPUs. 1)
| GPU | VRAM | Bandwidth | Price Range | Best For |
|---|---|---|---|---|
| RTX 4080 | 16 GB | 717 GB/s | $700-900 | Budget: 7B models |
| RTX 4090 | 24 GB | 1,008 GB/s | $1,100-1,200 | Sweet spot: 7-34B models |
| RTX 5090 | 32 GB | 1,792 GB/s | $2,000+ | Enthusiast: up to 70B quantized |
| A100 | 40-80 GB | 2,039 GB/s | Cloud only | Production multi-user serving |
| H100 | 80 GB | 3,350 GB/s | Enterprise | High-throughput production |
The RTX 4090 is the consumer sweet spot – 24GB VRAM at high bandwidth for around $1,100 used. For production workloads requiring concurrency, A100 or H100 instances via cloud providers are the standard choice. 2)
| Engine | Best For | Key Features | GPU Support |
|---|---|---|---|
| Ollama | Local development, beginners | One-command setup, auto-quantization | NVIDIA, AMD, Apple Silicon |
| vLLM | Production serving | Tensor parallelism, PagedAttention, high throughput | NVIDIA multi-GPU |
| TGI (Text Generation Inference) | Enterprise serving | High concurrency, Hugging Face integration | NVIDIA |
| llama.cpp | Maximum efficiency | CPU+GPU hybrid, GGUF quantization, low overhead | All platforms |
| LocalAI | Docker-first API serving | OpenAI-compatible API, model-agnostic | GPU passthrough |
Ollama is the best starting point for local experimentation. vLLM is the production standard for multi-user serving with its PagedAttention memory management. llama.cpp gives the best performance per watt for single-user inference. 3)
Quantization reduces model precision to shrink VRAM requirements with minimal quality loss:
| Format | Used By | Description |
|---|---|---|
| GGUF | Ollama, llama.cpp | Community standard for CPU/GPU hybrid inference |
| GPTQ | vLLM, TGI | GPU-optimized post-training quantization |
| AWQ | vLLM | Activation-aware quantization, preserves important weights |
The Q4_K_M quantization level offers the best balance of quality and size for most use cases. More aggressive quantization (Q2, Q3) saves VRAM but degrades output quality noticeably. 4)
docker run -d --gpus all \ -v ollama:/root/.ollama \ -p 11434:11434 \ --name ollama \ ollama/ollama # Pull and run a model docker exec -it ollama ollama run llama3:8b
docker run --gpus all -p 8000:8000 \ vllm/vllm-openai \ --model meta-llama/Llama-3-70B \ --quantization awq \ --tensor-parallel-size 2
docker run -p 8080:8080 --gpus all \ localai/localai:latest-aio-cuda12
Prerequisites for GPU passthrough: NVIDIA drivers, CUDA toolkit, and the NVIDIA Container Toolkit installed on the host. 5)
| Option | Monthly Cost | Pros | Cons |
|---|---|---|---|
| RTX 4090 (home) | ~$10 electricity | Full privacy, lowest latency | Upfront $1,800, noise/heat | ||
| Cloud GPU (Hetzner 4080) | ~$160/mo | No hardware management | Datacenter latency |
| API (GPT-4o equivalent) | $300-500/mo at moderate use | Zero infrastructure | Data leaves your network |
Self-hosting breaks even versus API costs within 3-6 months for heavy usage (>1M tokens/day). The primary advantage beyond cost is data sovereignty – your prompts and responses never leave your network. 6)
| Model | Parameters | Quality | Notes |
|---|---|---|---|
| Llama 3.1 | 8B / 70B | Near GPT-4 at 70B | Best general-purpose open model |
| Qwen 2.5 | 7B / 72B | Strong multilingual | Excellent for non-English |
| Mixtral 8x22B | 141B (sparse) | GPT-4 class | Mixture of experts, efficient inference |
| DeepSeek V3 | 671B (sparse) | Frontier class | Requires multi-GPU |