This is an old revision of the document!

Modal Compute

Modal is a serverless compute platform purpose-built for AI and ML workloads. It allows developers to run GPU-accelerated code, deploy web endpoints, and schedule batch jobs by simply decorating Python functions – no Docker, Kubernetes, or cloud configuration required. With sub-second cold starts, per-second billing, and access to GPUs from A10G to H100, Modal has become the platform of choice for deploying AI agents, running inference, fine-tuning models, and executing compute-intensive agent tasks.

Architecture

Modal's architecture is optimized for compute-intensive, bursty workloads:

Custom Container Runtime – Written in Rust for performance, based on gVisor (not Kubernetes). Achieves sub-second cold starts through aggressive container snapshotting.
Multi-Cloud GPU Pool – Aggregates GPU capacity across multiple cloud providers (including Oracle Cloud Infrastructure), dynamically routing workloads to optimal locations based on availability and cost.
Ingress Layer – TCP network load balancer, Caddy reverse proxy, and modal-http routing to serverless function invocation. Supports both HTTP and full WebSocket connections.
Scale-to-Zero – Containers automatically spin down when idle and scale up on demand, ensuring you only pay for actual compute time.
Image Caching – Container images are cached and snapshotted for rapid iteration. Dependency installation happens at build time, not runtime.

How It Works

Modal's core abstraction is the decorated Python function:

Write a regular Python function
Add a @app.function() decorator specifying compute requirements (GPU type, memory, image)
Modal builds a container image, snapshots it, and stores it
On invocation, Modal provisions a container with the specified GPU in under a second
The function executes, returns results, and the container tears down
You pay per-second for the exact compute time used

No YAML, no Dockerfile, no cloud console – everything is defined in Python.

GPU Types and Pricing

Modal provides on-demand access to a range of NVIDIA GPUs:

GPU	VRAM	Approx. Cost/sec	Use Case
T4	16 GB	$0.000164	Light inference, development
L4	24 GB	$0.000222	Efficient inference
A10G	24 GB	$0.000306	General inference, fine-tuning
L40S	48 GB	$0.000389	Large model inference
A100 40GB	40 GB	$0.000394	Training, large inference
A100 80GB	80 GB	$0.000463	Large-scale training
H100	80 GB	$0.000833	Maximum performance

Containers support up to 64 CPUs, 336 GB RAM, and 8 GPUs per instance. Free tier includes $30 in credits.

Key Features

Web Endpoints

Expose any function as an HTTP or WebSocket endpoint with zero-downtime updates:

@modal.asgi_app() for FastAPI/Starlette ASGI apps
@modal.web_endpoint() for simple request handlers
Full WebSocket support for real-time bidirectional messaging
Automatic TLS termination and domain routing

Scheduled Jobs (Cron)

Run functions on a schedule:

@app.function(schedule=modal.Cron(“0 */6 * * *”)) for cron expressions
@app.function(schedule=modal.Period(hours=1)) for interval-based scheduling
Ideal for data pipelines, model retraining, and periodic agent tasks

Volumes

Persistent, distributed file storage:

modal.Volume creates shared filesystems accessible from any function
Optimized for large model weights and datasets
Read/write access from multiple concurrent workers

Secrets Management

modal.Secret.from_name(“my-secret”) for secure credential access
Environment variable injection into function containers
Integrates with external secret stores

Code Example

import modal
 
app = modal.App("agent-inference")
 
# Define container image with dependencies
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch", "transformers", "accelerate"
)
 
# GPU-accelerated inference function
@app.function(gpu="A100", image=image, timeout=300)
def generate_text(prompt: str, max_tokens: int = 200) -> str:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
 
    model_name = "meta-llama/Llama-3.1-8B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16, device_map="auto"
    )
 
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=max_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
 
 
# Web endpoint for the inference API
@app.function(image=image)
@modal.asgi_app()
def web():
    from fastapi import FastAPI
 
    api = FastAPI()
 
    @api.post("/generate")
    async def generate(prompt: str, max_tokens: int = 200):
        result = generate_text.remote(prompt, max_tokens)
        return {"response": result}
 
    return api
 
 
# Scheduled job: refresh model cache daily
@app.function(schedule=modal.Cron("0 2 * * *"), image=image)
def daily_cache_refresh():
    from transformers import AutoModelForCausalLM, AutoTokenizer
 
    model_name = "meta-llama/Llama-3.1-8B-Instruct"
    AutoTokenizer.from_pretrained(model_name)
    AutoModelForCausalLM.from_pretrained(model_name)
    print("Model cache refreshed.")

Why Modal for AI Agents

Modal has become popular for agent deployment for several reasons:

No DevOps Required – Python-first, decorator-based approach eliminates infrastructure complexity
Sub-Second Cold Starts – Critical for responsive agent interactions; Rust runtime minimizes startup latency
Cost Efficiency – Per-second billing + scale-to-zero means no paying for idle GPUs between agent tasks
Large Resource Limits – 64 CPUs, 336 GB RAM, 8 H100s per container handles the most demanding models
WebSocket Support – Real-time bidirectional messaging for conversational agents
Rapid Iteration – Image caching and instant deploys enable fast development cycles

Architecture Diagram

  +---------------+       +---------------------------+
  |  Your Code    |       |    Modal Platform          |
  |               |       |                            |
  | @app.func()   |------>|  +---------------------+  |
  | @modal.asgi   |       |  | Image Builder        |  |
  | @Cron(...)    |       |  | (snapshot + cache)   |  |
  +---------------+       |  +---------+-----------+  |
                          |            |               |
                          |  +---------v-----------+   |
                          |  | Rust Container       |   |
                          |  | Runtime (gVisor)     |   |
                          |  +---------+-----------+   |
                          |            |                |
                          |  +---------v-----------+   |
                          |  |  GPU Pool            |   |
                          |  |  T4|A10G|A100|H100   |   |
                          |  |  (multi-cloud)       |   |
                          |  +---------------------+   |
                          |                            |
                          |  +---------------------+   |
                          |  |  Ingress / LB        |   |
                          |  |  HTTP + WebSocket    |   |
                          |  +---------------------+   |
                          +----------------------------+

AI Agent Knowledge Base

Sidebar

Table of Contents

Modal Compute

Architecture

How It Works

GPU Types and Pricing

Key Features

Web Endpoints

Scheduled Jobs (Cron)

Volumes

Secrets Management

Code Example

Why Modal for AI Agents

Architecture Diagram

References

See Also

AI Agent Knowledge Base

User Tools

Site Tools

Sidebar

Table of Contents

Modal Compute

Architecture

How It Works

GPU Types and Pricing

Key Features

Web Endpoints

Scheduled Jobs (Cron)

Volumes

Secrets Management

Code Example

Why Modal for AI Agents

Architecture Diagram

References

See Also

Page Tools