Ollama

Ollama is an open-source tool for running large language models locally on consumer hardware. Built as a Go-based HTTP server on top of the llama.cpp runtime, it simplifies model management, inference, and serving by bundling models with their configurations into single packages for easy deployment.¹⁾

How It Works

Ollama abstracts the complexities of local LLM inference into a simple CLI and API. Internally, it handles model downloads, loading into RAM or VRAM, quantization, and inference through the llama.cpp backend.²⁾ The Go server manages concurrent requests and keeps models loaded in memory with configurable timeouts. The runtime intelligently offloads computations to GPU when available, falling back to CPU with spillover from VRAM to system RAM for larger models.

Key architectural features:

Model packaging – bundles weights, tokenizer, and configuration into a single downloadable unit
Automatic quantization – applies quantization for fitting models on smaller GPUs or CPUs
Multi-model orchestration – manages multiple models concurrently
Memory management – configurable model keep-alive timeouts (default 30 minutes)

Supported Models

Ollama supports a wide range of open-source LLMs out of the box:³⁾

Llama – Meta's Llama 2, Llama 3, and variants
Mistral – Mistral 7B, Mixtral
Gemma – Google's Gemma models
Phi – Microsoft's Phi series
Qwen – Alibaba's Qwen models
DeepSeek – DeepSeek Coder and chat models
CodeLlama – Code-specialized Llama variants

Models are available in multiple quantization levels, and custom models can be created via Modelfiles.

REST API

Ollama exposes a native REST API on port 11434 (configurable via OLLAMA_HOST), with OpenAI-compatible endpoints for seamless integration:⁴⁾

/api/generate – single completions with streaming support
/api/chat – conversational interactions
/api/pull – download and manage models
/api/list – list available models
/api/show – model details and metadata
/api/copy – duplicate models
/api/delete – remove models
/api/embeddings – generate vector embeddings

The OpenAI-compatible endpoints allow drop-in replacement for applications using the OpenAI API format.

GPU Acceleration

Ollama leverages llama.cpp for hardware acceleration across multiple platforms:⁵⁾

NVIDIA CUDA – automatic detection and full GPU utilization, including Docker GPU passthrough
AMD ROCm – supported through the llama.cpp backend
Apple Metal – native optimization for M-series chips on macOS

Performance scales with available VRAM. Models fitting entirely in VRAM achieve latencies below 100ms, while spillover to system RAM can reduce throughput significantly.

Docker Support

Ollama provides full Docker integration via the ollama/ollama:latest image:⁶⁾

Volume mounts for model persistence (/root/.ollama)
NVIDIA GPU passthrough via Docker Compose device reservations
Pairs with UIs like Open-WebUI for complete local AI environments
Exposes port 11434 (API) alongside optional UI ports

Ollama vs llama.cpp

Aspect	Ollama	llama.cpp
Ease of Use	Beginner-friendly CLI/API; one-command install	Advanced; requires compilation
Performance	High via llama.cpp backend; abstracts tuning	Maximum control and efficiency
API	Native + OpenAI-compatible server	Native server mode; no built-in OpenAI compatibility
Customization	Model management and orchestration	Fine-grained inference parameters
Best For	Quick local development and workflows	Performance-critical and embedded applications