====== How to Use Ollama ======

Ollama is a lightweight tool for running large language models locally. It handles model downloading, quantization, GPU detection, and serving behind a simple CLI and REST API. This guide covers installation, model management, API usage, custom configurations, and integration with other tools.

===== Installation =====

=== Linux ===

Run the official install script:

  curl -fsSL https://ollama.com/install.sh | sh

For manual installation:

  curl -L https://ollama.com/download/ollama-linux-amd64 -o ollama
  chmod +x ollama
  sudo mv ollama /usr/local/bin/

Set up as a systemd service for automatic startup:

  sudo tee /etc/systemd/system/ollama.service > /dev/null <<EOF
  [Unit]
  Description=Ollama Server
  After=network-online.target
  [Service]
  ExecStart=/usr/local/bin/ollama serve
  User=ollama
  Group=ollama
  Restart=always
  RestartSec=3
  [Install]
  WantedBy=multi-user.target
  EOF
  
  sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
  sudo systemctl daemon-reload && sudo systemctl enable --now ollama

((Source: [[https://oneuptime.com/blog/post/2026-02-02-ollama-installation-configuration/view|OneUptime - Ollama Installation and Configuration]]))

=== macOS ===

Download the installer from [[https://ollama.com|ollama.com]], or install via Homebrew:

  brew install ollama

Ollama runs as a background service automatically after installation. Apple Silicon GPUs are supported natively.

=== Windows ===

Download ''OllamaSetup.exe'' from [[https://ollama.com|ollama.com]] and run the installer. No admin rights are required. Requires Windows 10 22H2 or later. Supports NVIDIA (driver 452.39+) and AMD Radeon GPUs. ((Source: [[https://docs.ollama.com/windows|Ollama Windows Documentation]]))

=== Docker ===

  docker run -d --gpus all \
    -v ollama:/root/.ollama \
    -p 11434:11434 \
    --name ollama \
    ollama/ollama

===== Pulling and Running Models =====

Basic model management commands:

  # List available local models
  ollama list
  
  # Pull a model without running it
  ollama pull llama3:8b
  
  # Run a model interactively
  ollama run llama3:8b
  
  # Show model details
  ollama show llama3:8b
  
  # Remove a model
  ollama rm llama3:8b
  
  # List running models
  ollama ps
  
  # Stop a running model
  ollama stop llama3:8b

===== Available Models =====

Browse the full library at [[https://ollama.com/library|ollama.com/library]]. Popular models include:

^ Model ^ Sizes ^ Strengths ^
| Llama 3 / 3.1 | 8B, 70B, 405B | Best general-purpose open model |
| Mistral / Mixtral | 7B, 8x7B, 8x22B | Fast inference, strong reasoning |
| Gemma 2 | 2B, 9B, 27B | Google's efficient model family |
| Qwen 2.5 / Qwen 3 | 0.5B-72B | Excellent multilingual, strong coding |
| DeepSeek V3 / R1 | 7B-671B | Strong reasoning and math |
| Phi-3 | 3.8B, 14B | Microsoft's compact models |
| CodeLlama | 7B, 13B, 34B | Specialized for code generation |

Pull with a specific tag for quantization variants: ''ollama pull llama3:8b-q4_0'' for 4-bit quantized. ((Source: [[https://docs.ollama.com/quickstart|Ollama Quick Start]]))

===== API Usage =====

Ollama serves a REST API on ''http://localhost:11434'' by default.

=== Generate Endpoint ===

  curl http://localhost:11434/api/generate -d '{
    "model": "llama3",
    "prompt": "Explain quantum computing in simple terms",
    "stream": false
  }'

=== Chat Endpoint ===

  curl http://localhost:11434/api/chat -d '{
    "model": "llama3",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Docker?"}
    ]
  }'

=== OpenAI-Compatible API ===

Ollama exposes an OpenAI-compatible endpoint at ''http://localhost:11434/v1''. This means any tool or library that supports the OpenAI API can connect to Ollama by changing the base URL:

  from openai import OpenAI
  
  client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
  response = client.chat.completions.create(
      model="llama3",
      messages=[{"role": "user", "content": "Hello"}]
  )

((Source: [[https://docs.ollama.com/windows|Ollama Documentation]]))

===== Custom Modelfiles =====

Modelfiles let you create custom model configurations with system prompts, parameters, and LoRA adapters.

Create a file named ''Modelfile'':

  FROM llama3
  
  SYSTEM "You are a senior Python developer. Respond with clean, well-documented code."
  
  PARAMETER temperature 0.3
  PARAMETER top_p 0.9
  PARAMETER num_ctx 8192

Build and run:

  ollama create python-expert -f Modelfile
  ollama run python-expert

=== Adding LoRA Adapters ===

  FROM llama3
  ADAPTER ./my-lora-adapter.gguf
  SYSTEM "You are a domain-specific assistant."

((Source: [[https://docs.ollama.com/quickstart|Ollama Quick Start]]))

===== GPU Setup =====

  * **NVIDIA**: Install drivers version 452.39 or later. Ollama auto-detects CUDA.
  * **AMD Radeon**: Install the latest AMD drivers. ROCm support is improving.
  * **Apple Silicon**: Native Metal support, no additional setup needed.
  * **Docker**: Pass ''%%--gpus all%%'' to enable GPU access inside the container.

Verify GPU detection by running a model and checking ''ollama ps'' -- it will show GPU layers loaded.

===== Integration with Other Tools =====

=== Open WebUI ===

A ChatGPT-like web interface for Ollama:

  docker run -d -p 3000:8080 \
    --add-host=host.docker.internal:host-gateway \
    -v open-webui:/app/backend/data \
    -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
    --name open-webui \
    ghcr.io/open-webui/open-webui:main

Access at ''http://localhost:3000''.

=== LangChain ===

  from langchain_ollama import OllamaLLM
  
  llm = OllamaLLM(model="llama3")
  response = llm.invoke("What is machine learning?")

=== Continue (VS Code) ===

Set Ollama as the provider in Continue's ''config.json'':

  {"provider": "ollama", "model": "llama3"}

((Source: [[https://docs.ollama.com/quickstart|Ollama Quick Start]]))

===== Performance Tuning =====

  * Use quantized model variants (''q4_0'', ''q4_K_M'') for faster inference and lower memory usage
  * Set ''OLLAMA_NUM_PARALLEL=4'' for concurrent request handling
  * Set ''OLLAMA_MAX_LOADED_MODELS=2'' to keep multiple models warm in memory
  * Monitor with ''ollama ps'' to see loaded models and GPU layer counts
  * Move model storage to NVMe: set ''OLLAMA_MODELS=/path/to/nvme/storage''

===== Troubleshooting =====

^ Problem ^ Solution ^
| Command not found | Add Ollama to PATH or reinstall |
| GPU not detected | Update drivers, restart ''ollama serve'' |
| Out of disk space | Set ''OLLAMA_MODELS'' to a larger drive |
| Unicode rendering issues (Windows) | Change terminal font to one with full Unicode support |
| Slow model downloads | Ensure stable network, try ''ollama pull'' again |
| Service fails to start | Check ''systemctl status ollama'' and logs |

===== See Also =====

  * [[how_to_self_host_an_llm|How to Self-Host an LLM]]
  * [[how_to_fine_tune_an_llm|How to Fine-Tune an LLM]]
  * [[how_to_build_a_chatbot|How to Build a Chatbot]]

===== References =====