====== Modal Compute ======
**Modal** is a serverless compute platform purpose-built for AI and ML workloads. It allows developers to run GPU-accelerated code, deploy web endpoints, and schedule batch jobs by simply decorating Python functions -- no Docker, Kubernetes, or cloud configuration required. With sub-second cold starts, per-second billing, and access to GPUs from A10G to H100, Modal has become the platform of choice for deploying AI agents, running inference, fine-tuning models, and executing compute-intensive agent tasks.
===== Architecture =====
Modal's architecture is optimized for compute-intensive, bursty workloads:
* **Custom Container Runtime** -- Written in Rust for performance, based on gVisor (not Kubernetes). Achieves sub-second cold starts through aggressive container snapshotting.
* **Multi-Cloud GPU Pool** -- Aggregates GPU capacity across multiple cloud providers (including Oracle Cloud Infrastructure), dynamically routing workloads to optimal locations based on availability and cost.
* **Ingress Layer** -- TCP network load balancer, Caddy reverse proxy, and modal-http routing to serverless function invocation. Supports both HTTP and full WebSocket connections.
* **Scale-to-Zero** -- Containers automatically spin down when idle and scale up on demand, ensuring you only pay for actual compute time.
* **Image Caching** -- Container images are cached and snapshotted for rapid iteration. Dependency installation happens at build time, not runtime.
===== How It Works =====
Modal's core abstraction is the **decorated Python function**:
- Write a regular Python function
- Add a ''@app.function()'' decorator specifying compute requirements (GPU type, memory, image)
- Modal builds a container image, snapshots it, and stores it
- On invocation, Modal provisions a container with the specified GPU in under a second
- The function executes, returns results, and the container tears down
- You pay per-second for the exact compute time used
No YAML, no Dockerfile, no cloud console -- everything is defined in Python.
===== GPU Types and Pricing =====
Modal provides on-demand access to a range of NVIDIA GPUs:
^ GPU ^ VRAM ^ Approx. Cost/sec ^ Use Case ^
| T4 | 16 GB | $0.000164 | Light inference, development |
| L4 | 24 GB | $0.000222 | Efficient inference |
| A10G | 24 GB | $0.000306 | General inference, fine-tuning |
| L40S | 48 GB | $0.000389 | Large model inference |
| A100 40GB | 40 GB | $0.000394 | Training, large inference |
| A100 80GB | 80 GB | $0.000463 | Large-scale training |
| H100 | 80 GB | $0.000833 | Maximum performance |
Containers support up to **64 CPUs**, **336 GB RAM**, and **8 GPUs** per instance. Free tier includes $30 in credits.
===== Key Features =====
== Web Endpoints ==
Expose any function as an HTTP or WebSocket endpoint with zero-downtime updates:
* ''@modal.asgi_app()'' for FastAPI/Starlette ASGI apps
* ''@modal.web_endpoint()'' for simple request handlers
* Full WebSocket support for real-time bidirectional messaging
* Automatic TLS termination and domain routing
== Scheduled Jobs (Cron) ==
Run functions on a schedule:
* ''@app.function(schedule=modal.Cron("0 */6 * * *"))'' for cron expressions
* ''@app.function(schedule=modal.Period(hours=1))'' for interval-based scheduling
* Ideal for data pipelines, model retraining, and periodic agent tasks
== Volumes ==
Persistent, distributed file storage:
* ''modal.Volume'' creates shared filesystems accessible from any function
* Optimized for large model weights and datasets
* Read/write access from multiple concurrent workers
== Secrets Management ==
* ''modal.Secret.from_name("my-secret")'' for secure credential access
* Environment variable injection into function containers
* Integrates with external secret stores
===== Code Example =====
import modal
app = modal.App("agent-inference")
# Define container image with dependencies
image = modal.Image.debian_slim(python_version="3.11").pip_install(
"torch", "transformers", "accelerate"
)
# GPU-accelerated inference function
@app.function(gpu="A100", image=image, timeout=300)
def generate_text(prompt: str, max_tokens: int = 200) -> str:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float16, device_map="auto"
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=max_tokens)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Web endpoint for the inference API
@app.function(image=image)
@modal.asgi_app()
def web():
from fastapi import FastAPI
api = FastAPI()
@api.post("/generate")
async def generate(prompt: str, max_tokens: int = 200):
result = generate_text.remote(prompt, max_tokens)
return {"response": result}
return api
# Scheduled job: refresh model cache daily
@app.function(schedule=modal.Cron("0 2 * * *"), image=image)
def daily_cache_refresh():
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
AutoTokenizer.from_pretrained(model_name)
AutoModelForCausalLM.from_pretrained(model_name)
print("Model cache refreshed.")
===== Why Modal for AI Agents =====
Modal has become popular for agent deployment for several reasons:
* **No DevOps Required** -- Python-first, decorator-based approach eliminates infrastructure complexity
* **Sub-Second Cold Starts** -- Critical for responsive agent interactions; Rust runtime minimizes startup latency
* **Cost Efficiency** -- Per-second billing + scale-to-zero means no paying for idle GPUs between agent tasks
* **Large Resource Limits** -- 64 CPUs, 336 GB RAM, 8 H100s per container handles the most demanding models
* **WebSocket Support** -- Real-time bidirectional messaging for conversational agents
* **Rapid Iteration** -- Image caching and instant deploys enable fast development cycles
===== Architecture Diagram =====
graph TD
A["Your Code (@app.func / @modal.asgi / @Cron)"] --> B["Modal Platform"]
B --> C["Image Builder (snapshot + cache)"]
C --> D["Rust Container Runtime (gVisor)"]
D --> E["GPU Pool (T4 / A10G / A100 / H100)"]
B --> F["Ingress / Load Balancer (HTTP + WebSocket)"]
F --> D
D --> G["Volumes (persistent storage)"]
D --> H["Secrets Manager"]
===== References =====
* [[https://modal.com/|Modal Website]]
* [[https://modal.com/docs/guide|Modal Documentation]]
* [[https://modal.com/blog/serverless-http|Modal: Serverless HTTP Architecture]]
* [[https://modal.com/blog/serverless-gpu-article|Modal: Serverless GPU Guide]]
===== See Also =====
* [[e2b|E2B]] -- Sandboxed code execution for AI agents
* [[autogen_studio|AutoGen Studio]] -- Visual multi-agent workflow builder
* [[composio|Composio]] -- Tool integration platform for agents
* [[browser_use|Browser-Use]] -- AI browser automation