Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Ollama is a lightweight tool for running large language models locally. It handles model downloading, quantization, GPU detection, and serving behind a simple CLI and REST API. This guide covers installation, model management, API usage, custom configurations, and integration with other tools.
Run the official install script:
curl -fsSL https://ollama.com/install.sh | sh
For manual installation:
curl -L https://ollama.com/download/ollama-linux-amd64 -o ollama chmod +x ollama sudo mv ollama /usr/local/bin/
Set up as a systemd service for automatic startup:
sudo tee /etc/systemd/system/ollama.service > /dev/null <<EOF [Unit] Description=Ollama Server After=network-online.target [Service] ExecStart=/usr/local/bin/ollama serve User=ollama Group=ollama Restart=always RestartSec=3 [Install] WantedBy=multi-user.target EOF sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama sudo systemctl daemon-reload && sudo systemctl enable --now ollama
Download the installer from ollama.com, or install via Homebrew:
brew install ollama
Ollama runs as a background service automatically after installation. Apple Silicon GPUs are supported natively.
Download OllamaSetup.exe from ollama.com and run the installer. No admin rights are required. Requires Windows 10 22H2 or later. Supports NVIDIA (driver 452.39+) and AMD Radeon GPUs. 2)
docker run -d --gpus all \ -v ollama:/root/.ollama \ -p 11434:11434 \ --name ollama \ ollama/ollama
Basic model management commands:
# List available local models ollama list # Pull a model without running it ollama pull llama3:8b # Run a model interactively ollama run llama3:8b # Show model details ollama show llama3:8b # Remove a model ollama rm llama3:8b # List running models ollama ps # Stop a running model ollama stop llama3:8b
Browse the full library at ollama.com/library. Popular models include:
| Model | Sizes | Strengths |
|---|---|---|
| Llama 3 / 3.1 | 8B, 70B, 405B | Best general-purpose open model |
| Mistral / Mixtral | 7B, 8x7B, 8x22B | Fast inference, strong reasoning |
| Gemma 2 | 2B, 9B, 27B | Google's efficient model family |
| Qwen 2.5 / Qwen 3 | 0.5B-72B | Excellent multilingual, strong coding |
| DeepSeek V3 / R1 | 7B-671B | Strong reasoning and math |
| Phi-3 | 3.8B, 14B | Microsoft's compact models |
| CodeLlama | 7B, 13B, 34B | Specialized for code generation |
Pull with a specific tag for quantization variants: ollama pull llama3:8b-q4_0 for 4-bit quantized. 3)
Ollama serves a REST API on http://localhost:11434 by default.
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Explain quantum computing in simple terms",
"stream": false
}'
curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Docker?"}
]
}'
Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. This means any tool or library that supports the OpenAI API can connect to Ollama by changing the base URL:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3",
messages=[{"role": "user", "content": "Hello"}]
)
Modelfiles let you create custom model configurations with system prompts, parameters, and LoRA adapters.
Create a file named Modelfile:
FROM llama3 SYSTEM "You are a senior Python developer. Respond with clean, well-documented code." PARAMETER temperature 0.3 PARAMETER top_p 0.9 PARAMETER num_ctx 8192
Build and run:
ollama create python-expert -f Modelfile ollama run python-expert
FROM llama3 ADAPTER ./my-lora-adapter.gguf SYSTEM "You are a domain-specific assistant."
--gpus all to enable GPU access inside the container.
Verify GPU detection by running a model and checking ollama ps – it will show GPU layers loaded.
A ChatGPT-like web interface for Ollama:
docker run -d -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \ --name open-webui \ ghcr.io/open-webui/open-webui:main
Access at http://localhost:3000.
from langchain_ollama import OllamaLLM
llm = OllamaLLM(model="llama3")
response = llm.invoke("What is machine learning?")
q4_0, q4_K_M) for faster inference and lower memory usageOLLAMA_NUM_PARALLEL=4 for concurrent request handlingOLLAMA_MAX_LOADED_MODELS=2 to keep multiple models warm in memoryollama ps to see loaded models and GPU layer countsOLLAMA_MODELS=/path/to/nvme/storage| Problem | Solution |
|---|---|
| Command not found | Add Ollama to PATH or reinstall |
| GPU not detected | Update drivers, restart ollama serve |
| Out of disk space | Set OLLAMA_MODELS to a larger drive |
| Unicode rendering issues (Windows) | Change terminal font to one with full Unicode support |
| Slow model downloads | Ensure stable network, try ollama pull again |
| Service fails to start | Check systemctl status ollama and logs |