Table of Contents

llama.cpp

llama.cpp is a lightweight C/C++ inference engine for running large language models locally with minimal dependencies. Built on the GGML tensor library, it prioritizes CPU-first efficiency, broad hardware support, and raw performance across consumer devices.1)

Architecture

The inference pipeline includes tokenization, forward pass via GGUF computation graphs, sampling (next-token probabilities), and detokenization. Key capabilities include KV cache management, grammar-constrained generation, speculative decoding (2-3x throughput gains), and multimodal support for vision-language models like LLaVA and MoonDream.2)

The engine requires no Python dependencies or runtime bloat, making it suitable for embedded and edge deployments.

GGUF Model Format

GGUF (GPT-Generated Unified Format) is a compact binary format that bundles model weights, tokenizer, quantization metadata, and configuration into a single file.3) Key properties:

Quantization Methods

Quantization compresses models to 1.5-8 bits per weight, enabling 7B+ parameter LLMs on 4-8GB RAM:4)

Quantization Bits Use Case ~RAM for 7B Model
Q4_K_M ~4 Speed/size balance ~4GB
Q5_K_M ~5 Improved accuracy ~5GB
Q8_0 8 Near-FP16 quality ~7GB

The K-quant variants (K_M, K_S, K_L) use k-means clustering for improved weight distribution. Hybrid CPU+GPU offloading allows larger models to partially reside in VRAM.

Metal and CUDA GPU Acceleration

llama.cpp supports hybrid CPU+GPU execution across multiple backends:5)

Layers can be dynamically split between CPU and GPU to fit VRAM limits using the -ngl flag.

Server Mode

llama-server provides a production-ready HTTP server for streaming inference:6)

Ecosystem

See Also

References