How to Use Ollama

Ollama is a lightweight tool for running large language models locally. It handles model downloading, quantization, GPU detection, and serving behind a simple CLI and REST API. This guide covers installation, model management, API usage, custom configurations, and integration with other tools.

Installation

Linux

Run the official install script:

curl -fsSL https://ollama.com/install.sh | sh

For manual installation:

curl -L https://ollama.com/download/ollama-linux-amd64 -o ollama
chmod +x ollama
sudo mv ollama /usr/local/bin/

Set up as a systemd service for automatic startup:

sudo tee /etc/systemd/system/ollama.service > /dev/null <<EOF
[Unit]
Description=Ollama Server
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
EOF

sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
sudo systemctl daemon-reload && sudo systemctl enable --now ollama

¹⁾

macOS

Download the installer from ollama.com, or install via Homebrew:

brew install ollama

Ollama runs as a background service automatically after installation. Apple Silicon GPUs are supported natively.

Windows

Download OllamaSetup.exe from ollama.com and run the installer. No admin rights are required. Requires Windows 10 22H2 or later. Supports NVIDIA (driver 452.39+) and AMD Radeon GPUs. ²⁾

Docker

docker run -d --gpus all \
  -v ollama:/root/.ollama \
  -p 11434:11434 \
  --name ollama \
  ollama/ollama

Pulling and Running Models

Basic model management commands:

# List available local models
ollama list

# Pull a model without running it
ollama pull llama3:8b

# Run a model interactively
ollama run llama3:8b

# Show model details
ollama show llama3:8b

# Remove a model
ollama rm llama3:8b

# List running models
ollama ps

# Stop a running model
ollama stop llama3:8b

Available Models

Browse the full library at ollama.com/library. Popular models include:

Model	Sizes	Strengths
Llama 3 / 3.1	8B, 70B, 405B	Best general-purpose open model
Mistral / Mixtral	7B, 8x7B, 8x22B	Fast inference, strong reasoning
Gemma 2	2B, 9B, 27B	Google's efficient model family
Qwen 2.5 / Qwen 3	0.5B-72B	Excellent multilingual, strong coding
DeepSeek V3 / R1	7B-671B	Strong reasoning and math
Phi-3	3.8B, 14B	Microsoft's compact models
CodeLlama	7B, 13B, 34B	Specialized for code generation

Pull with a specific tag for quantization variants: ollama pull llama3:8b-q4_0 for 4-bit quantized. ³⁾

API Usage

Ollama serves a REST API on http://localhost:11434 by default.

Generate Endpoint

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain quantum computing in simple terms",
  "stream": false
}'

Chat Endpoint

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Docker?"}
  ]
}'

OpenAI-Compatible API

Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. This means any tool or library that supports the OpenAI API can connect to Ollama by changing the base URL:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Hello"}]
)

⁴⁾

Custom Modelfiles

Modelfiles let you create custom model configurations with system prompts, parameters, and LoRA adapters.

Create a file named Modelfile:

FROM llama3

SYSTEM "You are a senior Python developer. Respond with clean, well-documented code."

PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

Build and run:

ollama create python-expert -f Modelfile
ollama run python-expert

Adding LoRA Adapters

FROM llama3
ADAPTER ./my-lora-adapter.gguf
SYSTEM "You are a domain-specific assistant."

⁵⁾

GPU Setup

NVIDIA: Install drivers version 452.39 or later. Ollama auto-detects CUDA.
AMD Radeon: Install the latest AMD drivers. ROCm support is improving.
Apple Silicon: Native Metal support, no additional setup needed.
Docker: Pass --gpus all to enable GPU access inside the container.

Verify GPU detection by running a model and checking ollama ps – it will show GPU layers loaded.

Integration with Other Tools

Open WebUI

A ChatGPT-like web interface for Ollama:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Access at http://localhost:3000.

LangChain

from langchain_ollama import OllamaLLM

llm = OllamaLLM(model="llama3")
response = llm.invoke("What is machine learning?")

Continue (VS Code)

Set Ollama as the provider in Continue's config.json:

{"provider": "ollama", "model": "llama3"}

⁶⁾

Performance Tuning

Use quantized model variants (q4_0, q4_K_M) for faster inference and lower memory usage
Set OLLAMA_NUM_PARALLEL=4 for concurrent request handling
Set OLLAMA_MAX_LOADED_MODELS=2 to keep multiple models warm in memory
Monitor with ollama ps to see loaded models and GPU layer counts
Move model storage to NVMe: set OLLAMA_MODELS=/path/to/nvme/storage

Troubleshooting

Problem	Solution
Command not found	Add Ollama to PATH or reinstall
GPU not detected	Update drivers, restart `ollama serve`
Out of disk space	Set `OLLAMA_MODELS` to a larger drive
Unicode rendering issues (Windows)	Change terminal font to one with full Unicode support
Slow model downloads	Ensure stable network, try `ollama pull` again
Service fails to start	Check `systemctl status ollama` and logs

References

¹⁾

Source: OneUptime - Ollama Installation and Configuration

²⁾

Source: Ollama Windows Documentation

³⁾ , ⁵⁾ , ⁶⁾

Source: Ollama Quick Start

⁴⁾

Source: Ollama Documentation

AI Agent Knowledge Base

Sidebar

Table of Contents

How to Use Ollama

Installation

Linux

macOS

Windows

Docker

Pulling and Running Models

Available Models

API Usage

Generate Endpoint

Chat Endpoint

OpenAI-Compatible API

Custom Modelfiles

Adding LoRA Adapters

GPU Setup

Integration with Other Tools

Open WebUI

LangChain

Continue (VS Code)

Performance Tuning

Troubleshooting

See Also

References

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

How to Use Ollama

Installation

Linux

macOS

Windows

Docker

Pulling and Running Models

Available Models

API Usage

Generate Endpoint

Chat Endpoint

OpenAI-Compatible API

Custom Modelfiles

Adding LoRA Adapters

GPU Setup

Integration with Other Tools

Open WebUI

LangChain

Continue (VS Code)

Performance Tuning

Troubleshooting

See Also

References

Page Tools