Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
Core Concepts
Reasoning
Memory & Retrieval
Agent Types
Design Patterns
Training & Alignment
Frameworks
Tools
Safety & Security
Evaluation
Meta
This is an old revision of the document!
Modal is a serverless compute platform purpose-built for AI and ML workloads. It allows developers to run GPU-accelerated code, deploy web endpoints, and schedule batch jobs by simply decorating Python functions – no Docker, Kubernetes, or cloud configuration required. With sub-second cold starts, per-second billing, and access to GPUs from A10G to H100, Modal has become the platform of choice for deploying AI agents, running inference, fine-tuning models, and executing compute-intensive agent tasks.
Modal's architecture is optimized for compute-intensive, bursty workloads:
Modal's core abstraction is the decorated Python function:
@app.function() decorator specifying compute requirements (GPU type, memory, image)No YAML, no Dockerfile, no cloud console – everything is defined in Python.
Modal provides on-demand access to a range of NVIDIA GPUs:
| GPU | VRAM | Approx. Cost/sec | Use Case |
|---|---|---|---|
| T4 | 16 GB | $0.000164 | Light inference, development |
| L4 | 24 GB | $0.000222 | Efficient inference |
| A10G | 24 GB | $0.000306 | General inference, fine-tuning |
| L40S | 48 GB | $0.000389 | Large model inference |
| A100 40GB | 40 GB | $0.000394 | Training, large inference |
| A100 80GB | 80 GB | $0.000463 | Large-scale training |
| H100 | 80 GB | $0.000833 | Maximum performance |
Containers support up to 64 CPUs, 336 GB RAM, and 8 GPUs per instance. Free tier includes $30 in credits.
Expose any function as an HTTP or WebSocket endpoint with zero-downtime updates:
@modal.asgi_app() for FastAPI/Starlette ASGI apps@modal.web_endpoint() for simple request handlersRun functions on a schedule:
@app.function(schedule=modal.Cron(“0 */6 * * *”)) for cron expressions@app.function(schedule=modal.Period(hours=1)) for interval-based schedulingPersistent, distributed file storage:
modal.Volume creates shared filesystems accessible from any functionmodal.Secret.from_name(“my-secret”) for secure credential accessimport modal app = modal.App("agent-inference") # Define container image with dependencies image = modal.Image.debian_slim(python_version="3.11").pip_install( "torch", "transformers", "accelerate" ) # GPU-accelerated inference function @app.function(gpu="A100", image=image, timeout=300) def generate_text(prompt: str, max_tokens: int = 200) -> str: from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "meta-llama/Llama-3.1-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" ) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=max_tokens) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Web endpoint for the inference API @app.function(image=image) @modal.asgi_app() def web(): from fastapi import FastAPI api = FastAPI() @api.post("/generate") async def generate(prompt: str, max_tokens: int = 200): result = generate_text.remote(prompt, max_tokens) return {"response": result} return api # Scheduled job: refresh model cache daily @app.function(schedule=modal.Cron("0 2 * * *"), image=image) def daily_cache_refresh(): from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "meta-llama/Llama-3.1-8B-Instruct" AutoTokenizer.from_pretrained(model_name) AutoModelForCausalLM.from_pretrained(model_name) print("Model cache refreshed.")
Modal has become popular for agent deployment for several reasons:
+---------------+ +---------------------------+
| Your Code | | Modal Platform |
| | | |
| @app.func() |------>| +---------------------+ |
| @modal.asgi | | | Image Builder | |
| @Cron(...) | | | (snapshot + cache) | |
+---------------+ | +---------+-----------+ |
| | |
| +---------v-----------+ |
| | Rust Container | |
| | Runtime (gVisor) | |
| +---------+-----------+ |
| | |
| +---------v-----------+ |
| | GPU Pool | |
| | T4|A10G|A100|H100 | |
| | (multi-cloud) | |
| +---------------------+ |
| |
| +---------------------+ |
| | Ingress / LB | |
| | HTTP + WebSocket | |
| +---------------------+ |
+----------------------------+