====== How to Self-Host an LLM ====== Self-hosting gives you full control over your LLM infrastructure -- data stays on your network, per-token costs drop at scale, and you can customize models freely. This guide covers hardware selection, inference engines, deployment, and cost trade-offs. ===== Hardware Requirements ===== LLM inference is **memory-bandwidth bound**, not compute bound. Prioritize VRAM capacity and bandwidth over raw FLOPS. === VRAM by Model Size === ^ Model Size ^ VRAM (4-bit Quantized) ^ VRAM (FP16) ^ System RAM ^ | 7B | 8-12 GB | 14-16 GB | 32 GB | | 13B | 16 GB | 26-30 GB | 32 GB | | 34B | 20-24 GB | 68 GB | 64 GB | | 70B | 24-40 GB | 140 GB | 64-128 GB | A 7B parameter model at 4-bit quantization runs comfortably on a 12GB GPU. A 70B model needs 24-40GB depending on quantization level, or can be split across multiple GPUs. ((Source: [[https://www.kunalganglani.com/blog/running-local-llms-2026-hardware-setup-guide/|Kunal Ganglani - Local LLMs Hardware Guide]])) === GPU Selection === ^ GPU ^ VRAM ^ Bandwidth ^ Price Range ^ Best For ^ | RTX 4080 | 16 GB | 717 GB/s | $700-900 | Budget: 7B models | | RTX 4090 | 24 GB | 1,008 GB/s | $1,100-1,200 | Sweet spot: 7-34B models | | RTX 5090 | 32 GB | 1,792 GB/s | $2,000+ | Enthusiast: up to 70B quantized | | A100 | 40-80 GB | 2,039 GB/s | Cloud only | Production multi-user serving | | H100 | 80 GB | 3,350 GB/s | Enterprise | High-throughput production | The RTX 4090 is the consumer sweet spot -- 24GB VRAM at high bandwidth for around $1,100 used. For production workloads requiring concurrency, A100 or H100 instances via cloud providers are the standard choice. ((Source: [[https://createaiagent.net/self-hosted-llm/|CreateAIAgent - Self-Hosted LLM]])) ===== Inference Engines ===== ^ Engine ^ Best For ^ Key Features ^ GPU Support ^ | Ollama | Local development, beginners | One-command setup, auto-quantization | NVIDIA, AMD, Apple Silicon | | vLLM | Production serving | Tensor parallelism, PagedAttention, high throughput | NVIDIA multi-GPU | | TGI (Text Generation Inference) | Enterprise serving | High concurrency, Hugging Face integration | NVIDIA | | llama.cpp | Maximum efficiency | CPU+GPU hybrid, GGUF quantization, low overhead | All platforms | | LocalAI | Docker-first API serving | OpenAI-compatible API, model-agnostic | GPU passthrough | **Ollama** is the best starting point for local experimentation. **vLLM** is the production standard for multi-user serving with its PagedAttention memory management. **llama.cpp** gives the best performance per watt for single-user inference. ((Source: [[https://blog.premai.io/self-hosted-llm-guide-setup-tools-cost-comparison-2026/|PremAI Self-Hosted LLM Guide]])) ===== Quantization ===== Quantization reduces model precision to shrink VRAM requirements with minimal quality loss: ^ Format ^ Used By ^ Description ^ | GGUF | Ollama, llama.cpp | Community standard for CPU/GPU hybrid inference | | GPTQ | vLLM, TGI | GPU-optimized post-training quantization | | AWQ | vLLM | Activation-aware quantization, preserves important weights | The ''Q4_K_M'' quantization level offers the best balance of quality and size for most use cases. More aggressive quantization (Q2, Q3) saves VRAM but degrades output quality noticeably. ((Source: [[https://www.kunalganglani.com/blog/running-local-llms-2026-hardware-setup-guide/|Kunal Ganglani - Local LLMs Hardware Guide]])) ===== Docker Deployment ===== === Ollama via Docker === docker run -d --gpus all \ -v ollama:/root/.ollama \ -p 11434:11434 \ --name ollama \ ollama/ollama # Pull and run a model docker exec -it ollama ollama run llama3:8b === vLLM via Docker === docker run --gpus all -p 8000:8000 \ vllm/vllm-openai \ --model meta-llama/Llama-3-70B \ --quantization awq \ --tensor-parallel-size 2 === LocalAI via Docker === docker run -p 8080:8080 --gpus all \ localai/localai:latest-aio-cuda12 Prerequisites for GPU passthrough: NVIDIA drivers, CUDA toolkit, and the NVIDIA Container Toolkit installed on the host. ((Source: [[https://blog.premai.io/self-hosted-llm-guide-setup-tools-cost-comparison-2026/|PremAI Self-Hosted LLM Guide]])) ===== Cost Comparison ===== ^ Option ^ Monthly Cost ^ Pros ^ Cons ^ | RTX 4090 (home) | ~$10 electricity | Full privacy, lowest latency | Upfront $1,800, noise/heat | | Cloud GPU (Hetzner 4080) | ~$160/mo | No hardware management | Datacenter latency | | API (GPT-4o equivalent) | $300-500/mo at moderate use | Zero infrastructure | Data leaves your network | Self-hosting breaks even versus API costs within 3-6 months for heavy usage (>1M tokens/day). The primary advantage beyond cost is data sovereignty -- your prompts and responses never leave your network. ((Source: [[https://createaiagent.net/self-hosted-llm/|CreateAIAgent - Self-Hosted LLM]])) ===== Performance Optimization ===== * **Use NVMe SSDs** -- model loading from NVMe is 5-10x faster than SATA * **Maximize GPU memory bandwidth** -- choose GPUs with higher GB/s over higher TFLOPS * **Layer offloading** -- split model layers between GPU and CPU RAM when VRAM is tight * **Batch requests** -- vLLM's continuous batching serves multiple users efficiently * **Tensor parallelism** -- split large models across multiple GPUs for production throughput === Recommended Models === ^ Model ^ Parameters ^ Quality ^ Notes ^ | Llama 3.1 | 8B / 70B | Near GPT-4 at 70B | Best general-purpose open model | | Qwen 2.5 | 7B / 72B | Strong multilingual | Excellent for non-English | | Mixtral 8x22B | 141B (sparse) | GPT-4 class | Mixture of experts, efficient inference | | DeepSeek V3 | 671B (sparse) | Frontier class | Requires multi-GPU | ((Source: [[https://www.kunalganglani.com/blog/running-local-llms-2026-hardware-setup-guide/|Kunal Ganglani - Local LLMs Hardware Guide]])) ===== See Also ===== * [[how_to_use_ollama|How to Use Ollama]] * [[how_to_fine_tune_an_llm|How to Fine-Tune an LLM]] * [[how_to_build_a_chatbot|How to Build a Chatbot]] ===== References =====