====== How to Self-Host an LLM ======

Self-hosting gives you full control over your LLM infrastructure -- data stays on your network, per-token costs drop at scale, and you can customize models freely. This guide covers hardware selection, inference engines, deployment, and cost trade-offs.

===== Hardware Requirements =====

LLM inference is **memory-bandwidth bound**, not compute bound. Prioritize VRAM capacity and bandwidth over raw FLOPS.

=== VRAM by Model Size ===

^ Model Size ^ VRAM (4-bit Quantized) ^ VRAM (FP16) ^ System RAM ^
| 7B | 8-12 GB | 14-16 GB | 32 GB |
| 13B | 16 GB | 26-30 GB | 32 GB |
| 34B | 20-24 GB | 68 GB | 64 GB |
| 70B | 24-40 GB | 140 GB | 64-128 GB |

A 7B parameter model at 4-bit quantization runs comfortably on a 12GB GPU. A 70B model needs 24-40GB depending on quantization level, or can be split across multiple GPUs. ((Source: [[https://www.kunalganglani.com/blog/running-local-llms-2026-hardware-setup-guide/|Kunal Ganglani - Local LLMs Hardware Guide]]))

=== GPU Selection ===

^ GPU ^ VRAM ^ Bandwidth ^ Price Range ^ Best For ^
| RTX 4080 | 16 GB | 717 GB/s | $700-900 | Budget: 7B models |
| RTX 4090 | 24 GB | 1,008 GB/s | $1,100-1,200 | Sweet spot: 7-34B models |
| RTX 5090 | 32 GB | 1,792 GB/s | $2,000+ | Enthusiast: up to 70B quantized |
| A100 | 40-80 GB | 2,039 GB/s | Cloud only | Production multi-user serving |
| H100 | 80 GB | 3,350 GB/s | Enterprise | High-throughput production |

The RTX 4090 is the consumer sweet spot -- 24GB VRAM at high bandwidth for around $1,100 used. For production workloads requiring concurrency, A100 or H100 instances via cloud providers are the standard choice. ((Source: [[https://createaiagent.net/self-hosted-llm/|CreateAIAgent - Self-Hosted LLM]]))

===== Inference Engines =====

^ Engine ^ Best For ^ Key Features ^ GPU Support ^
| Ollama | Local development, beginners | One-command setup, auto-quantization | NVIDIA, AMD, Apple Silicon |
| vLLM | Production serving | Tensor parallelism, PagedAttention, high throughput | NVIDIA multi-GPU |
| TGI (Text Generation Inference) | Enterprise serving | High concurrency, Hugging Face integration | NVIDIA |
| llama.cpp | Maximum efficiency | CPU+GPU hybrid, GGUF quantization, low overhead | All platforms |
| LocalAI | Docker-first API serving | OpenAI-compatible API, model-agnostic | GPU passthrough |

**Ollama** is the best starting point for local experimentation. **vLLM** is the production standard for multi-user serving with its PagedAttention memory management. **llama.cpp** gives the best performance per watt for single-user inference. ((Source: [[https://blog.premai.io/self-hosted-llm-guide-setup-tools-cost-comparison-2026/|PremAI Self-Hosted LLM Guide]]))

===== Quantization =====

Quantization reduces model precision to shrink VRAM requirements with minimal quality loss:

^ Format ^ Used By ^ Description ^
| GGUF | Ollama, llama.cpp | Community standard for CPU/GPU hybrid inference |
| GPTQ | vLLM, TGI | GPU-optimized post-training quantization |
| AWQ | vLLM | Activation-aware quantization, preserves important weights |

The ''Q4_K_M'' quantization level offers the best balance of quality and size for most use cases. More aggressive quantization (Q2, Q3) saves VRAM but degrades output quality noticeably. ((Source: [[https://www.kunalganglani.com/blog/running-local-llms-2026-hardware-setup-guide/|Kunal Ganglani - Local LLMs Hardware Guide]]))

===== Docker Deployment =====

=== Ollama via Docker ===

  docker run -d --gpus all \
    -v ollama:/root/.ollama \
    -p 11434:11434 \
    --name ollama \
    ollama/ollama
  
  # Pull and run a model
  docker exec -it ollama ollama run llama3:8b

=== vLLM via Docker ===

  docker run --gpus all -p 8000:8000 \
    vllm/vllm-openai \
    --model meta-llama/Llama-3-70B \
    --quantization awq \
    --tensor-parallel-size 2

=== LocalAI via Docker ===

  docker run -p 8080:8080 --gpus all \
    localai/localai:latest-aio-cuda12

Prerequisites for GPU passthrough: NVIDIA drivers, CUDA toolkit, and the NVIDIA Container Toolkit installed on the host. ((Source: [[https://blog.premai.io/self-hosted-llm-guide-setup-tools-cost-comparison-2026/|PremAI Self-Hosted LLM Guide]]))

===== Cost Comparison =====

^ Option ^ Monthly Cost ^ Pros ^ Cons ^
| RTX 4090 (home) | ~$10 electricity | Full privacy, lowest latency | Upfront $1,800, noise/heat |
| Cloud GPU (Hetzner 4080) | ~$160/mo | No hardware management | Datacenter latency |
| API (GPT-4o equivalent) | $300-500/mo at moderate use | Zero infrastructure | Data leaves your network |

Self-hosting breaks even versus API costs within 3-6 months for heavy usage (>1M tokens/day). The primary advantage beyond cost is data sovereignty -- your prompts and responses never leave your network. ((Source: [[https://createaiagent.net/self-hosted-llm/|CreateAIAgent - Self-Hosted LLM]]))

===== Performance Optimization =====

  * **Use NVMe SSDs** -- model loading from NVMe is 5-10x faster than SATA
  * **Maximize GPU memory bandwidth** -- choose GPUs with higher GB/s over higher TFLOPS
  * **Layer offloading** -- split model layers between GPU and CPU RAM when VRAM is tight
  * **Batch requests** -- vLLM's continuous batching serves multiple users efficiently
  * **Tensor parallelism** -- split large models across multiple GPUs for production throughput

=== Recommended Models ===

^ Model ^ Parameters ^ Quality ^ Notes ^
| Llama 3.1 | 8B / 70B | Near GPT-4 at 70B | Best general-purpose open model |
| Qwen 2.5 | 7B / 72B | Strong multilingual | Excellent for non-English |
| Mixtral 8x22B | 141B (sparse) | GPT-4 class | Mixture of experts, efficient inference |
| DeepSeek V3 | 671B (sparse) | Frontier class | Requires multi-GPU |

((Source: [[https://www.kunalganglani.com/blog/running-local-llms-2026-hardware-setup-guide/|Kunal Ganglani - Local LLMs Hardware Guide]]))

===== See Also =====

  * [[how_to_use_ollama|How to Use Ollama]]
  * [[how_to_fine_tune_an_llm|How to Fine-Tune an LLM]]
  * [[how_to_build_a_chatbot|How to Build a Chatbot]]

===== References =====