Architecture
How It Works
GPU Types and Pricing
Key Features
Code Example
Why Modal for AI Agents
Architecture Diagram
References
See Also

Modal Compute

Modal is a serverless compute platform purpose-built for AI and ML workloads. It allows developers to run GPU-accelerated code, deploy web endpoints, and schedule batch jobs by simply decorating Python functions – no Docker, Kubernetes, or cloud configuration required. With sub-second cold starts, per-second billing, and access to GPUs from A10G to H100, Modal has become the platform of choice for deploying AI agents, running inference, fine-tuning models, and executing compute-intensive agent tasks.

Architecture

Modal's architecture is optimized for compute-intensive, bursty workloads:

Custom Container Runtime – Written in Rust for performance, based on gVisor (not Kubernetes). Achieves sub-second cold starts through aggressive container snapshotting.
Multi-Cloud GPU Pool – Aggregates GPU capacity across multiple cloud providers (including Oracle Cloud Infrastructure), dynamically routing workloads to optimal locations based on availability and cost.
Ingress Layer – TCP network load balancer, Caddy reverse proxy, and modal-http routing to serverless function invocation. Supports both HTTP and full WebSocket connections.
Scale-to-Zero – Containers automatically spin down when idle and scale up on demand, ensuring you only pay for actual compute time.
Image Caching – Container images are cached and snapshotted for rapid iteration. Dependency installation happens at build time, not runtime.

How It Works

Modal's core abstraction is the decorated Python function:

Write a regular Python function
Add a @app.function() decorator specifying compute requirements (GPU type, memory, image)
Modal builds a container image, snapshots it, and stores it
On invocation, Modal provisions a container with the specified GPU in under a second
The function executes, returns results, and the container tears down
You pay per-second for the exact compute time used

No YAML, no Dockerfile, no cloud console – everything is defined in Python.

GPU Types and Pricing

Modal provides on-demand access to a range of NVIDIA GPUs:

GPU	VRAM	Approx. Cost/sec	Use Case
T4	16 GB	$0.000164	Light inference, development
L4	24 GB	$0.000222	Efficient inference
A10G	24 GB	$0.000306	General inference, fine-tuning
L40S	48 GB	$0.000389	Large model inference
A100 40GB	40 GB	$0.000394	Training, large inference
A100 80GB	80 GB	$0.000463	Large-scale training
H100	80 GB	$0.000833	Maximum performance

Containers support up to 64 CPUs, 336 GB RAM, and 8 GPUs per instance. Free tier includes $30 in credits.

Key Features

Web Endpoints

Expose any function as an HTTP or WebSocket endpoint with zero-downtime updates:

@modal.asgi_app() for FastAPI/Starlette ASGI apps
@modal.web_endpoint() for simple request handlers
Full WebSocket support for real-time bidirectional messaging
Automatic TLS termination and domain routing

Scheduled Jobs (Cron)

Run functions on a schedule:

@app.function(schedule=modal.Cron(“0 */6 * * *”)) for cron expressions
@app.function(schedule=modal.Period(hours=1)) for interval-based scheduling
Ideal for data pipelines, model retraining, and periodic agent tasks

Volumes

Persistent, distributed file storage:

modal.Volume creates shared filesystems accessible from any function
Optimized for large model weights and datasets
Read/write access from multiple concurrent workers

Secrets Management

modal.Secret.from_name(“my-secret”) for secure credential access
Environment variable injection into function containers
Integrates with external secret stores

Code Example

import modal
 
app = modal.App("agent-inference")
 
# Define container image with dependencies
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch", "transformers", "accelerate"
)
 
# GPU-accelerated inference function
@app.function(gpu="A100", image=image, timeout=300)
def generate_text(prompt: str, max_tokens: int = 200) -> str:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
 
    model_name = "meta-llama/Llama-3.1-8B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16, device_map="auto"
    )
 
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=max_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
 
 
# Web endpoint for the inference API
@app.function(image=image)
@modal.asgi_app()
def web():
    from fastapi import FastAPI
 
    api = FastAPI()
 
    @api.post("/generate")
    async def generate(prompt: str, max_tokens: int = 200):
        result = generate_text.remote(prompt, max_tokens)
        return {"response": result}
 
    return api
 
 
# Scheduled job: refresh model cache daily
@app.function(schedule=modal.Cron("0 2 * * *"), image=image)
def daily_cache_refresh():
    from transformers import AutoModelForCausalLM, AutoTokenizer
 
    model_name = "meta-llama/Llama-3.1-8B-Instruct"
    AutoTokenizer.from_pretrained(model_name)
    AutoModelForCausalLM.from_pretrained(model_name)
    print("Model cache refreshed.")

Why Modal for AI Agents

Modal has become popular for agent deployment for several reasons:

No DevOps Required – Python-first, decorator-based approach eliminates infrastructure complexity
Sub-Second Cold Starts – Critical for responsive agent interactions; Rust runtime minimizes startup latency
Cost Efficiency – Per-second billing + scale-to-zero means no paying for idle GPUs between agent tasks
Large Resource Limits – 64 CPUs, 336 GB RAM, 8 H100s per container handles the most demanding models
WebSocket Support – Real-time bidirectional messaging for conversational agents
Rapid Iteration – Image caching and instant deploys enable fast development cycles

Architecture Diagram

graph TD A["Your Code (@app.func / @modal.asgi / @Cron)"] --> B["Modal Platform"] B --> C["Image Builder (snapshot + cache)"] C --> D["Rust Container Runtime (gVisor)"] D --> E["GPU Pool (T4 / A10G / A100 / H100)"] B --> F["Ingress / Load Balancer (HTTP + WebSocket)"] F --> D D --> G["Volumes (persistent storage)"] D --> H["Secrets Manager"]

Table of Contents