llama.cpp is a lightweight C/C++ inference engine for running large language models locally with minimal dependencies. Built on the GGML tensor library, it prioritizes CPU-first efficiency, broad hardware support, and raw performance across consumer devices.1)
The inference pipeline includes tokenization, forward pass via GGUF computation graphs, sampling (next-token probabilities), and detokenization. Key capabilities include KV cache management, grammar-constrained generation, speculative decoding (2-3x throughput gains), and multimodal support for vision-language models like LLaVA and MoonDream.2)
The engine requires no Python dependencies or runtime bloat, making it suitable for embedded and edge deployments.
GGUF (GPT-Generated Unified Format) is a compact binary format that bundles model weights, tokenizer, quantization metadata, and configuration into a single file.3) Key properties:
Quantization compresses models to 1.5-8 bits per weight, enabling 7B+ parameter LLMs on 4-8GB RAM:4)
| Quantization | Bits | Use Case | ~RAM for 7B Model |
|---|---|---|---|
| Q4_K_M | ~4 | Speed/size balance | ~4GB |
| Q5_K_M | ~5 | Improved accuracy | ~5GB |
| Q8_0 | 8 | Near-FP16 quality | ~7GB |
The K-quant variants (K_M, K_S, K_L) use k-means clustering for improved weight distribution. Hybrid CPU+GPU offloading allows larger models to partially reside in VRAM.
llama.cpp supports hybrid CPU+GPU execution across multiple backends:5)
-DGGML_CUDA=ON, supports architectures through Blackwell (sm_121)
Layers can be dynamically split between CPU and GPU to fit VRAM limits using the -ngl flag.
llama-server provides a production-ready HTTP server for streaming inference:6)
--hf-repo flag for pulling models from Hugging Face Hub