How to Self-Host an LLM

Self-hosting gives you full control over your LLM infrastructure – data stays on your network, per-token costs drop at scale, and you can customize models freely. This guide covers hardware selection, inference engines, deployment, and cost trade-offs.

Hardware Requirements

LLM inference is memory-bandwidth bound, not compute bound. Prioritize VRAM capacity and bandwidth over raw FLOPS.

VRAM by Model Size

Model Size	VRAM (4-bit Quantized)	VRAM (FP16)	System RAM
7B	8-12 GB	14-16 GB	32 GB
13B	16 GB	26-30 GB	32 GB
34B	20-24 GB	68 GB	64 GB
70B	24-40 GB	140 GB	64-128 GB

A 7B parameter model at 4-bit quantization runs comfortably on a 12GB GPU. A 70B model needs 24-40GB depending on quantization level, or can be split across multiple GPUs. ¹⁾

GPU Selection

GPU	VRAM	Bandwidth	Price Range	Best For
RTX 4080	16 GB	717 GB/s	$700-900	Budget: 7B models
RTX 4090	24 GB	1,008 GB/s	$1,100-1,200	Sweet spot: 7-34B models
RTX 5090	32 GB	1,792 GB/s	$2,000+	Enthusiast: up to 70B quantized
A100	40-80 GB	2,039 GB/s	Cloud only	Production multi-user serving
H100	80 GB	3,350 GB/s	Enterprise	High-throughput production

The RTX 4090 is the consumer sweet spot – 24GB VRAM at high bandwidth for around $1,100 used. For production workloads requiring concurrency, A100 or H100 instances via cloud providers are the standard choice. ²⁾

Inference Engines

Engine	Best For	Key Features	GPU Support
Ollama	Local development, beginners	One-command setup, auto-quantization	NVIDIA, AMD, Apple Silicon
vLLM	Production serving	Tensor parallelism, PagedAttention, high throughput	NVIDIA multi-GPU
TGI (Text Generation Inference)	Enterprise serving	High concurrency, Hugging Face integration	NVIDIA
llama.cpp	Maximum efficiency	CPU+GPU hybrid, GGUF quantization, low overhead	All platforms
LocalAI	Docker-first API serving	OpenAI-compatible API, model-agnostic	GPU passthrough

Ollama is the best starting point for local experimentation. vLLM is the production standard for multi-user serving with its PagedAttention memory management. llama.cpp gives the best performance per watt for single-user inference. ³⁾

Quantization

Quantization reduces model precision to shrink VRAM requirements with minimal quality loss:

Format	Used By	Description
GGUF	Ollama, llama.cpp	Community standard for CPU/GPU hybrid inference
GPTQ	vLLM, TGI	GPU-optimized post-training quantization
AWQ	vLLM	Activation-aware quantization, preserves important weights

The Q4_K_M quantization level offers the best balance of quality and size for most use cases. More aggressive quantization (Q2, Q3) saves VRAM but degrades output quality noticeably. ⁴⁾

Docker Deployment

Ollama via Docker

docker run -d --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

# Pull and run a model
docker exec -it ollama ollama run llama3:8b

vLLM via Docker

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai \
  --model meta-llama/Llama-3-70B \
  --quantization awq \
  --tensor-parallel-size 2

LocalAI via Docker

docker run -p 8080:8080 --gpus all \
  localai/localai:latest-aio-cuda12

Prerequisites for GPU passthrough: NVIDIA drivers, CUDA toolkit, and the NVIDIA Container Toolkit installed on the host. ⁵⁾

Cost Comparison

Option	Monthly Cost	Pros	Cons
RTX 4090 (home)	~$10 electricity \| Full privacy, lowest latency \| Upfront $1,800, noise/heat
Cloud GPU (Hetzner 4080)	~$160/mo	No hardware management	Datacenter latency
API (GPT-4o equivalent)	$300-500/mo at moderate use	Zero infrastructure	Data leaves your network

Self-hosting breaks even versus API costs within 3-6 months for heavy usage (>1M tokens/day). The primary advantage beyond cost is data sovereignty – your prompts and responses never leave your network. ⁶⁾

Performance Optimization

Use NVMe SSDs – model loading from NVMe is 5-10x faster than SATA
Maximize GPU memory bandwidth – choose GPUs with higher GB/s over higher TFLOPS
Layer offloading – split model layers between GPU and CPU RAM when VRAM is tight
Batch requests – vLLM's continuous batching serves multiple users efficiently
Tensor parallelism – split large models across multiple GPUs for production throughput

Recommended Models

Model	Parameters	Quality	Notes
Llama 3.1	8B / 70B	Near GPT-4 at 70B	Best general-purpose open model
Qwen 2.5	7B / 72B	Strong multilingual	Excellent for non-English
Mixtral 8x22B	141B (sparse)	GPT-4 class	Mixture of experts, efficient inference
DeepSeek V3	671B (sparse)	Frontier class	Requires multi-GPU

⁷⁾

References

¹⁾ , ⁴⁾ , ⁷⁾

Source: Kunal Ganglani - Local LLMs Hardware Guide

²⁾ , ⁶⁾

Source: CreateAIAgent - Self-Hosted LLM

³⁾ , ⁵⁾

Source: PremAI Self-Hosted LLM Guide

AI Agent Knowledge Base

Sidebar

Table of Contents

How to Self-Host an LLM

Hardware Requirements

VRAM by Model Size

GPU Selection

Inference Engines

Quantization

Docker Deployment

Ollama via Docker

vLLM via Docker

LocalAI via Docker

Cost Comparison

Performance Optimization

Recommended Models

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

How to Self-Host an LLM

Hardware Requirements

VRAM by Model Size

GPU Selection

Inference Engines

Quantization

Docker Deployment

Ollama via Docker

vLLM via Docker

LocalAI via Docker

Cost Comparison

Performance Optimization

Recommended Models

See Also

References

Page Tools