====== Ollama ======

**Ollama** is an open-source tool for running large language models locally on consumer hardware. Built as a Go-based HTTP server on top of the [[llama_cpp|llama.cpp]] runtime, it simplifies model management, inference, and serving by bundling models with their configurations into single packages for easy deployment.((source [[https://github.com/ollama/ollama|Ollama GitHub Repository]]))

===== How It Works =====

Ollama abstracts the complexities of local LLM inference into a simple CLI and API. Internally, it handles model downloads, loading into RAM or VRAM, quantization, and inference through the llama.cpp backend.((source [[https://daily.dev/blog/running-llms-locally-ollama-llama-cpp-self-hosted-ai-developers|Running LLMs Locally - daily.dev]])) The Go server manages concurrent requests and keeps models loaded in memory with configurable timeouts. The runtime intelligently offloads computations to GPU when available, falling back to CPU with spillover from VRAM to system RAM for larger models.

Key architectural features:

  * **Model packaging** -- bundles weights, tokenizer, and configuration into a single downloadable unit
  * **Automatic quantization** -- applies quantization for fitting models on smaller GPUs or CPUs
  * **Multi-model orchestration** -- manages multiple models concurrently
  * **Memory management** -- configurable model keep-alive timeouts (default 30 minutes)

===== Supported Models =====

Ollama supports a wide range of open-source LLMs out of the box:((source [[https://ollama.com/library|Ollama Model Library]]))

  * **Llama** -- Meta's Llama 2, Llama 3, and variants
  * **Mistral** -- Mistral 7B, Mixtral
  * **Gemma** -- Google's Gemma models
  * **Phi** -- Microsoft's Phi series
  * **Qwen** -- Alibaba's Qwen models
  * **DeepSeek** -- DeepSeek Coder and chat models
  * **CodeLlama** -- Code-specialized Llama variants

Models are available in multiple quantization levels, and custom models can be created via Modelfiles.

===== REST API =====

Ollama exposes a native REST API on port 11434 (configurable via ''OLLAMA_HOST''), with OpenAI-compatible endpoints for seamless integration:((source [[https://github.com/ollama/ollama/blob/main/docs/api.md|Ollama API Documentation]]))

  * ''/api/generate'' -- single completions with streaming support
  * ''/api/chat'' -- conversational interactions
  * ''/api/pull'' -- download and manage models
  * ''/api/list'' -- list available models
  * ''/api/show'' -- model details and metadata
  * ''/api/copy'' -- duplicate models
  * ''/api/delete'' -- remove models
  * ''/api/embeddings'' -- generate vector embeddings

The OpenAI-compatible endpoints allow drop-in replacement for applications using the OpenAI API format.

===== GPU Acceleration =====

Ollama leverages llama.cpp for hardware acceleration across multiple platforms:((source [[https://www.bentoml.com/blog/running-local-llms-with-ollama-3-levels-from-local-to-distributed-inference|Running Local LLMs with Ollama - BentoML]]))

  * **NVIDIA CUDA** -- automatic detection and full GPU utilization, including Docker GPU passthrough
  * **AMD ROCm** -- supported through the llama.cpp backend
  * **Apple Metal** -- native optimization for M-series chips on macOS

Performance scales with available VRAM. Models fitting entirely in VRAM achieve latencies below 100ms, while spillover to system RAM can reduce throughput significantly.

===== Docker Support =====

Ollama provides full Docker integration via the ''ollama/ollama:latest'' image:((source [[https://oneuptime.com/blog/post/2026-01-25-ollama-local-llm-development/view|Ollama Local LLM Development]]))

  * Volume mounts for model persistence (''/root/.ollama'')
  * NVIDIA GPU passthrough via Docker Compose device reservations
  * Pairs with UIs like Open-WebUI for complete local AI environments
  * Exposes port 11434 (API) alongside optional UI ports

===== Ollama vs llama.cpp =====

^ Aspect ^ Ollama ^ llama.cpp ^
| Ease of Use | Beginner-friendly CLI/API; one-command install | Advanced; requires compilation |
| Performance | High via llama.cpp backend; abstracts tuning | Maximum control and efficiency |
| API | Native + OpenAI-compatible server | Native server mode; no built-in OpenAI compatibility |
| Customization | Model management and orchestration | Fine-grained inference parameters |
| Best For | Quick local development and workflows | Performance-critical and embedded applications |

===== See Also =====

  * [[llama_cpp|llama.cpp]]
  * [[vllm|vLLM]]
  * [[hugging_face|Hugging Face]]
  * [[text_generation_inference|Text Generation Inference]]

===== References =====