How to Self-Host an LLM

Self-hosting gives you full control over your LLM infrastructure – data stays on your network, per-token costs drop at scale, and you can customize models freely. This guide covers hardware selection, inference engines, deployment, and cost trade-offs.

Hardware Requirements

LLM inference is memory-bandwidth bound, not compute bound. Prioritize VRAM capacity and bandwidth over raw FLOPS.

VRAM by Model Size

Model Size	VRAM (4-bit Quantized)	VRAM (FP16)	System RAM
7B	8-12 GB	14-16 GB	32 GB
13B	16 GB	26-30 GB	32 GB
34B	20-24 GB	68 GB	64 GB
70B	24-40 GB	140 GB	64-128 GB

A 7B parameter model at 4-bit quantization runs comfortably on a 12GB GPU. A 70B model needs 24-40GB depending on quantization level, or can be split across multiple GPUs. ¹⁾

GPU Selection

GPU	VRAM	Bandwidth	Price Range	Best For
RTX 4080	16 GB	717 GB/s	$700-900	Budget: 7B models
RTX 4090	24 GB	1,008 GB/s	$1,100-1,200	Sweet spot: 7-34B models
RTX 5090	32 GB	1,792 GB/s	$2,000+	Enthusiast: up to 70B quantized
A100	40-80 GB	2,039 GB/s	Cloud only	Production multi-user serving
H100	80 GB	3,350 GB/s	Enterprise	High-throughput production

The RTX 4090 is the consumer sweet spot – 24GB VRAM at high bandwidth for around $1,100 used. For production workloads requiring concurrency, A100 or H100 instances via cloud providers are the standard choice. ²⁾

Inference Engines

Engine	Best For	Key Features	GPU Support
Ollama	Local development, beginners	One-command setup, auto-quantization	NVIDIA, AMD, Apple Silicon
vLLM	Production serving	Tensor parallelism, PagedAttention, high throughput	NVIDIA multi-GPU
TGI (Text Generation Inference)	Enterprise serving	High concurrency, Hugging Face integration	NVIDIA
llama.cpp	Maximum efficiency	CPU+GPU hybrid, GGUF quantization, low overhead	All platforms
LocalAI	Docker-first API serving	OpenAI-compatible API, model-agnostic	GPU passthrough

Ollama is the best starting point for local experimentation. vLLM is the production standard for multi-user serving with its PagedAttention memory management. llama.cpp gives the best performance per watt for single-user inference. ³⁾

Quantization

Quantization reduces model precision to shrink VRAM requirements with minimal quality loss:

Format	Used By	Description
GGUF	Ollama, llama.cpp	Community standard for CPU/GPU hybrid inference
GPTQ	vLLM, TGI	GPU-optimized post-training quantization
AWQ	vLLM	Activation-aware quantization, preserves important weights

The Q4_K_M quantization level offers the best balance of quality and size for most use cases. More aggressive quantization (Q2, Q3) saves VRAM but degrades output quality noticeably. ⁴⁾

Docker Deployment

Ollama via Docker

docker run -d --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Pull and run a model
docker exec -it ollama ollama run llama3:8b

vLLM via Docker

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai \
  --model meta-llama/Llama-3-70B \
  --quantization awq \
  --tensor-parallel-size 2

LocalAI via Docker

docker run -p 8080:8080 --gpus all \
  localai/localai:latest-aio-cuda12

Prerequisites for GPU passthrough: NVIDIA drivers, CUDA toolkit, and the NVIDIA Container Toolkit installed on the host. ⁵⁾

Cost Comparison

Option	Monthly Cost	Pros	Cons
RTX 4090 (home)	~$10 electricity \| Full privacy, lowest latency \| Upfront $1,800, noise/heat
Cloud GPU (Hetzner 4080)	~$160/mo	No hardware management	Datacenter latency
API (GPT-4o equivalent)	$300-500/mo at moderate use	Zero infrastructure	Data leaves your network

Self-hosting breaks even versus API costs within 3-6 months for heavy usage (>1M tokens/day). The primary advantage beyond cost is data sovereignty – your prompts and responses never leave your network. ⁶⁾

Performance Optimization

Use NVMe SSDs – model loading from NVMe is 5-10x faster than SATA
Maximize GPU memory bandwidth – choose GPUs with higher GB/s over higher TFLOPS
Layer offloading – split model layers between GPU and CPU RAM when VRAM is tight
Batch requests – vLLM's continuous batching serves multiple users efficiently
Tensor parallelism – split large models across multiple GPUs for production throughput

Recommended Models

Model	Parameters	Quality	Notes
Llama 3.1	8B / 70B	Near GPT-4 at 70B	Best general-purpose open model
Qwen 2.5	7B / 72B	Strong multilingual	Excellent for non-English
Mixtral 8x22B	141B (sparse)	GPT-4 class	Mixture of experts, efficient inference
DeepSeek V3	671B (sparse)	Frontier class	Requires multi-GPU

⁷⁾

References

¹⁾ , ⁴⁾ , ⁷⁾

Source: Kunal Ganglani - Local LLMs Hardware Guide

²⁾ , ⁶⁾

Source: CreateAIAgent - Self-Hosted LLM

³⁾ , ⁵⁾

Source: PremAI Self-Hosted LLM Guide

Table of Contents