====== llama.cpp ====== **llama.cpp** is a lightweight C/C++ inference engine for running large language models locally with minimal dependencies. Built on the GGML tensor library, it prioritizes CPU-first efficiency, broad hardware support, and raw performance across consumer devices.((source [[https://github.com/ggml-org/llama.cpp|llama.cpp GitHub Repository]])) ===== Architecture ===== The inference pipeline includes tokenization, forward pass via GGUF computation graphs, sampling (next-token probabilities), and detokenization. Key capabilities include KV cache management, grammar-constrained generation, speculative decoding (2-3x throughput gains), and multimodal support for vision-language models like LLaVA and MoonDream.((source [[https://www.sandgarden.com/learn/llama-cpp|llama.cpp Overview - Sandgarden]])) The engine requires no Python dependencies or runtime bloat, making it suitable for embedded and edge deployments. ===== GGUF Model Format ===== **GGUF** (GPT-Generated Unified Format) is a compact binary format that bundles model weights, tokenizer, quantization metadata, and configuration into a single file.((source [[https://pyimagesearch.com/2024/08/26/llama-cpp-the-ultimate-guide-to-efficient-llm-inference-and-applications/|llama.cpp Ultimate Guide - PyImageSearch]])) Key properties: * **Memory mapping** -- enables rapid loading via mmap, ideal for edge devices where I/O is a bottleneck * **Self-contained** -- all metadata embedded, no external config files needed * **Architecture-agnostic** -- supports models beyond LLaMA, including custom tokenizers * **Instant startup** -- fast cold-start times compared to framework-based loading ===== Quantization Methods ===== Quantization compresses models to 1.5-8 bits per weight, enabling 7B+ parameter LLMs on 4-8GB RAM:((source [[https://www.decodesfuture.com/articles/llama-cpp-gguf-quantization-guide-2026|GGUF Quantization Guide 2026]])) ^ Quantization ^ Bits ^ Use Case ^ ~RAM for 7B Model ^ | Q4_K_M | ~4 | Speed/size balance | ~4GB | | Q5_K_M | ~5 | Improved accuracy | ~5GB | | Q8_0 | 8 | Near-FP16 quality | ~7GB | The K-quant variants (K_M, K_S, K_L) use k-means clustering for improved weight distribution. Hybrid CPU+GPU offloading allows larger models to partially reside in VRAM. ===== Metal and CUDA GPU Acceleration ===== llama.cpp supports hybrid CPU+GPU execution across multiple backends:((source [[https://www.sandgarden.com/learn/llama-cpp|llama.cpp - Sandgarden]])) * **CUDA (NVIDIA)** -- enabled with ''-DGGML_CUDA=ON'', supports architectures through Blackwell (sm_121) * **Metal (Apple Silicon)** -- native integration for M-series chips via Apple's Metal API * **Vulkan** -- cross-platform GPU backend, highlighted at Vulkanised 2026 * **SYCL (Intel)** -- oneAPI backend for Intel GPUs * **ROCm (AMD)** -- AMD GPU support * **OpenCL** -- legacy cross-platform support Layers can be dynamically split between CPU and GPU to fit VRAM limits using the ''-ngl'' flag. ===== Server Mode ===== **llama-server** provides a production-ready HTTP server for streaming inference:((source [[https://www.sandgarden.com/learn/llama-cpp|llama.cpp - Sandgarden]])) * Configurable threading and context size * Real-time token streaming for latency-sensitive applications * Sampling parameter control (temperature, top-k, top-p) * Chat and completion endpoints * Embedding generation support ===== Ecosystem ===== * **C API** -- stable interface for custom integrations and bindings * **Community bindings** -- Python (llama-cpp-python), Go, Rust, Node.js, and more * **Model conversion** -- scripts for converting HuggingFace models to GGUF * **Direct HF downloads** -- ''%%--hf-repo%%'' flag for pulling models from Hugging Face Hub * **Speculative decoding** -- draft-model-assisted generation for faster throughput * **WebAssembly** -- browser-based inference support ===== See Also ===== * [[ollama|Ollama]] * [[vllm|vLLM]] * [[text_generation_inference|Text Generation Inference]] * [[hugging_face|Hugging Face]] ===== References =====