====== llama.cpp ======

**llama.cpp** is a lightweight C/C++ inference engine for running large language models locally with minimal dependencies. Built on the GGML tensor library, it prioritizes CPU-first efficiency, broad hardware support, and raw performance across consumer devices.((source [[https://github.com/ggml-org/llama.cpp|llama.cpp GitHub Repository]]))

===== Architecture =====

The inference pipeline includes tokenization, forward pass via GGUF computation graphs, sampling (next-token probabilities), and detokenization. Key capabilities include KV cache management, grammar-constrained generation, speculative decoding (2-3x throughput gains), and multimodal support for vision-language models like LLaVA and MoonDream.((source [[https://www.sandgarden.com/learn/llama-cpp|llama.cpp Overview - Sandgarden]]))

The engine requires no Python dependencies or runtime bloat, making it suitable for embedded and edge deployments.

===== GGUF Model Format =====

**GGUF** (GPT-Generated Unified Format) is a compact binary format that bundles model weights, tokenizer, quantization metadata, and configuration into a single file.((source [[https://pyimagesearch.com/2024/08/26/llama-cpp-the-ultimate-guide-to-efficient-llm-inference-and-applications/|llama.cpp Ultimate Guide - PyImageSearch]])) Key properties:

  * **Memory mapping** -- enables rapid loading via mmap, ideal for edge devices where I/O is a bottleneck
  * **Self-contained** -- all metadata embedded, no external config files needed
  * **Architecture-agnostic** -- supports models beyond LLaMA, including custom tokenizers
  * **Instant startup** -- fast cold-start times compared to framework-based loading

===== Quantization Methods =====

Quantization compresses models to 1.5-8 bits per weight, enabling 7B+ parameter LLMs on 4-8GB RAM:((source [[https://www.decodesfuture.com/articles/llama-cpp-gguf-quantization-guide-2026|GGUF Quantization Guide 2026]]))

^ Quantization ^ Bits ^ Use Case ^ ~RAM for 7B Model ^
| Q4_K_M | ~4 | Speed/size balance | ~4GB |
| Q5_K_M | ~5 | Improved accuracy | ~5GB |
| Q8_0 | 8 | Near-FP16 quality | ~7GB |

The K-quant variants (K_M, K_S, K_L) use k-means clustering for improved weight distribution. Hybrid CPU+GPU offloading allows larger models to partially reside in VRAM.

===== Metal and CUDA GPU Acceleration =====

llama.cpp supports hybrid CPU+GPU execution across multiple backends:((source [[https://www.sandgarden.com/learn/llama-cpp|llama.cpp - Sandgarden]]))

  * **CUDA (NVIDIA)** -- enabled with ''-DGGML_CUDA=ON'', supports architectures through Blackwell (sm_121)
  * **Metal (Apple Silicon)** -- native integration for M-series chips via Apple's Metal API
  * **Vulkan** -- cross-platform GPU backend, highlighted at Vulkanised 2026
  * **SYCL (Intel)** -- oneAPI backend for Intel GPUs
  * **ROCm (AMD)** -- AMD GPU support
  * **OpenCL** -- legacy cross-platform support

Layers can be dynamically split between CPU and GPU to fit VRAM limits using the ''-ngl'' flag.

===== Server Mode =====

**llama-server** provides a production-ready HTTP server for streaming inference:((source [[https://www.sandgarden.com/learn/llama-cpp|llama.cpp - Sandgarden]]))

  * Configurable threading and context size
  * Real-time token streaming for latency-sensitive applications
  * Sampling parameter control (temperature, top-k, top-p)
  * Chat and completion endpoints
  * Embedding generation support

===== Ecosystem =====

  * **C API** -- stable interface for custom integrations and bindings
  * **Community bindings** -- Python (llama-cpp-python), Go, Rust, Node.js, and more
  * **Model conversion** -- scripts for converting HuggingFace models to GGUF
  * **Direct HF downloads** -- ''%%--hf-repo%%'' flag for pulling models from Hugging Face Hub
  * **Speculative decoding** -- draft-model-assisted generation for faster throughput
  * **WebAssembly** -- browser-based inference support

===== See Also =====

  * [[ollama|Ollama]]
  * [[vllm|vLLM]]
  * [[text_generation_inference|Text Generation Inference]]
  * [[hugging_face|Hugging Face]]

===== References =====