Table of Contents

Modal Compute

Modal is a serverless compute platform purpose-built for AI and ML workloads. It allows developers to run GPU-accelerated code, deploy web endpoints, and schedule batch jobs by simply decorating Python functions – no Docker, Kubernetes, or cloud configuration required. With sub-second cold starts, per-second billing, and access to GPUs from A10G to H100, Modal has become the platform of choice for deploying AI agents, running inference, fine-tuning models, and executing compute-intensive agent tasks.

Architecture

Modal's architecture is optimized for compute-intensive, bursty workloads:

How It Works

Modal's core abstraction is the decorated Python function:

  1. Write a regular Python function
  2. Add a @app.function() decorator specifying compute requirements (GPU type, memory, image)
  3. Modal builds a container image, snapshots it, and stores it
  4. On invocation, Modal provisions a container with the specified GPU in under a second
  5. The function executes, returns results, and the container tears down
  6. You pay per-second for the exact compute time used

No YAML, no Dockerfile, no cloud console – everything is defined in Python.

GPU Types and Pricing

Modal provides on-demand access to a range of NVIDIA GPUs:

GPU VRAM Approx. Cost/sec Use Case
T4 16 GB $0.000164 Light inference, development
L4 24 GB $0.000222 Efficient inference
A10G 24 GB $0.000306 General inference, fine-tuning
L40S 48 GB $0.000389 Large model inference
A100 40GB 40 GB $0.000394 Training, large inference
A100 80GB 80 GB $0.000463 Large-scale training
H100 80 GB $0.000833 Maximum performance

Containers support up to 64 CPUs, 336 GB RAM, and 8 GPUs per instance. Free tier includes $30 in credits.

Key Features

Web Endpoints

Expose any function as an HTTP or WebSocket endpoint with zero-downtime updates:

Scheduled Jobs (Cron)

Run functions on a schedule:

Volumes

Persistent, distributed file storage:

Secrets Management

Code Example

import modal
 
app = modal.App("agent-inference")
 
# Define container image with dependencies
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch", "transformers", "accelerate"
)
 
# GPU-accelerated inference function
@app.function(gpu="A100", image=image, timeout=300)
def generate_text(prompt: str, max_tokens: int = 200) -> str:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch
 
    model_name = "meta-llama/Llama-3.1-8B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16, device_map="auto"
    )
 
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=max_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
 
 
# Web endpoint for the inference API
@app.function(image=image)
@modal.asgi_app()
def web():
    from fastapi import FastAPI
 
    api = FastAPI()
 
    @api.post("/generate")
    async def generate(prompt: str, max_tokens: int = 200):
        result = generate_text.remote(prompt, max_tokens)
        return {"response": result}
 
    return api
 
 
# Scheduled job: refresh model cache daily
@app.function(schedule=modal.Cron("0 2 * * *"), image=image)
def daily_cache_refresh():
    from transformers import AutoModelForCausalLM, AutoTokenizer
 
    model_name = "meta-llama/Llama-3.1-8B-Instruct"
    AutoTokenizer.from_pretrained(model_name)
    AutoModelForCausalLM.from_pretrained(model_name)
    print("Model cache refreshed.")

Why Modal for AI Agents

Modal has become popular for agent deployment for several reasons:

Architecture Diagram

graph TD A["Your Code (@app.func / @modal.asgi / @Cron)"] --> B["Modal Platform"] B --> C["Image Builder (snapshot + cache)"] C --> D["Rust Container Runtime (gVisor)"] D --> E["GPU Pool (T4 / A10G / A100 / H100)"] B --> F["Ingress / Load Balancer (HTTP + WebSocket)"] F --> D D --> G["Volumes (persistent storage)"] D --> H["Secrets Manager"]

References

See Also