Ollama is an open-source tool for running large language models locally on consumer hardware. Built as a Go-based HTTP server on top of the llama.cpp runtime, it simplifies model management, inference, and serving by bundling models with their configurations into single packages for easy deployment.1)
Ollama abstracts the complexities of local LLM inference into a simple CLI and API. Internally, it handles model downloads, loading into RAM or VRAM, quantization, and inference through the llama.cpp backend.2) The Go server manages concurrent requests and keeps models loaded in memory with configurable timeouts. The runtime intelligently offloads computations to GPU when available, falling back to CPU with spillover from VRAM to system RAM for larger models.
Key architectural features:
Ollama supports a wide range of open-source LLMs out of the box:3)
Models are available in multiple quantization levels, and custom models can be created via Modelfiles.
Ollama exposes a native REST API on port 11434 (configurable via OLLAMA_HOST), with OpenAI-compatible endpoints for seamless integration:4)
/api/generate – single completions with streaming support/api/chat – conversational interactions/api/pull – download and manage models/api/list – list available models/api/show – model details and metadata/api/copy – duplicate models/api/delete – remove models/api/embeddings – generate vector embeddingsThe OpenAI-compatible endpoints allow drop-in replacement for applications using the OpenAI API format.
Ollama leverages llama.cpp for hardware acceleration across multiple platforms:5)
Performance scales with available VRAM. Models fitting entirely in VRAM achieve latencies below 100ms, while spillover to system RAM can reduce throughput significantly.
Ollama provides full Docker integration via the ollama/ollama:latest image:6)
/root/.ollama)| Aspect | Ollama | llama.cpp |
|---|---|---|
| Ease of Use | Beginner-friendly CLI/API; one-command install | Advanced; requires compilation |
| Performance | High via llama.cpp backend; abstracts tuning | Maximum control and efficiency |
| API | Native + OpenAI-compatible server | Native server mode; no built-in OpenAI compatibility |
| Customization | Model management and orchestration | Fine-grained inference parameters |
| Best For | Quick local development and workflows | Performance-critical and embedded applications |