llama.cpp

llama.cpp is a lightweight C/C++ inference engine for running large language models locally with minimal dependencies. Built on the GGML tensor library, it prioritizes CPU-first efficiency, broad hardware support, and raw performance across consumer devices.¹⁾

Architecture

The inference pipeline includes tokenization, forward pass via GGUF computation graphs, sampling (next-token probabilities), and detokenization. Key capabilities include KV cache management, grammar-constrained generation, speculative decoding (2-3x throughput gains), and multimodal support for vision-language models like LLaVA and MoonDream.²⁾

The engine requires no Python dependencies or runtime bloat, making it suitable for embedded and edge deployments.

GGUF Model Format

GGUF (GPT-Generated Unified Format) is a compact binary format that bundles model weights, tokenizer, quantization metadata, and configuration into a single file.³⁾ Key properties:

Memory mapping – enables rapid loading via mmap, ideal for edge devices where I/O is a bottleneck
Self-contained – all metadata embedded, no external config files needed
Architecture-agnostic – supports models beyond LLaMA, including custom tokenizers
Instant startup – fast cold-start times compared to framework-based loading

Quantization Methods

Quantization compresses models to 1.5-8 bits per weight, enabling 7B+ parameter LLMs on 4-8GB RAM:⁴⁾

Quantization	Bits	Use Case	~RAM for 7B Model
Q4_K_M	~4	Speed/size balance	~4GB
Q5_K_M	~5	Improved accuracy	~5GB
Q8_0	8	Near-FP16 quality	~7GB

The K-quant variants (K_M, K_S, K_L) use k-means clustering for improved weight distribution. Hybrid CPU+GPU offloading allows larger models to partially reside in VRAM.

Metal and CUDA GPU Acceleration

llama.cpp supports hybrid CPU+GPU execution across multiple backends:⁵⁾

CUDA (NVIDIA) – enabled with -DGGML_CUDA=ON, supports architectures through Blackwell (sm_121)
Metal (Apple Silicon) – native integration for M-series chips via Apple's Metal API
Vulkan – cross-platform GPU backend, highlighted at Vulkanised 2026
SYCL (Intel) – oneAPI backend for Intel GPUs
ROCm (AMD) – AMD GPU support
OpenCL – legacy cross-platform support

Layers can be dynamically split between CPU and GPU to fit VRAM limits using the -ngl flag.

Server Mode

llama-server provides a production-ready HTTP server for streaming inference:⁶⁾

Configurable threading and context size
Real-time token streaming for latency-sensitive applications
Sampling parameter control (temperature, top-k, top-p)
Chat and completion endpoints
Embedding generation support

Ecosystem

C API – stable interface for custom integrations and bindings
Community bindings – Python (llama-cpp-python), Go, Rust, Node.js, and more
Model conversion – scripts for converting HuggingFace models to GGUF
Direct HF downloads – --hf-repo flag for pulling models from Hugging Face Hub
Speculative decoding – draft-model-assisted generation for faster throughput
WebAssembly – browser-based inference support