====== Modal Compute ======

**Modal** is a serverless compute platform purpose-built for AI and ML workloads. It allows developers to run GPU-accelerated code, deploy web endpoints, and schedule batch jobs by simply decorating Python functions -- no Docker, Kubernetes, or cloud configuration required. With sub-second cold starts, per-second billing, and access to GPUs from A10G to H100, Modal has become the platform of choice for deploying AI agents, running inference, fine-tuning models, and executing compute-intensive agent tasks.

===== Architecture =====

Modal's architecture is optimized for compute-intensive, bursty workloads:

  * **Custom Container Runtime** -- Written in Rust for performance, based on gVisor (not Kubernetes). Achieves sub-second cold starts through aggressive container snapshotting.
  * **Multi-Cloud GPU Pool** -- Aggregates GPU capacity across multiple cloud providers (including Oracle Cloud Infrastructure), dynamically routing workloads to optimal locations based on availability and cost.
  * **Ingress Layer** -- TCP network load balancer, Caddy reverse proxy, and modal-http routing to serverless function invocation. Supports both HTTP and full WebSocket connections.
  * **Scale-to-Zero** -- Containers automatically spin down when idle and scale up on demand, ensuring you only pay for actual compute time.
  * **Image Caching** -- Container images are cached and snapshotted for rapid iteration. Dependency installation happens at build time, not runtime.

===== How It Works =====

Modal's core abstraction is the **decorated Python function**:

  - Write a regular Python function
  - Add a ''@app.function()'' decorator specifying compute requirements (GPU type, memory, image)
  - Modal builds a container image, snapshots it, and stores it
  - On invocation, Modal provisions a container with the specified GPU in under a second
  - The function executes, returns results, and the container tears down
  - You pay per-second for the exact compute time used

No YAML, no Dockerfile, no cloud console -- everything is defined in Python.

===== GPU Types and Pricing =====

Modal provides on-demand access to a range of NVIDIA GPUs:

^ GPU ^ VRAM ^ Approx. Cost/sec ^ Use Case ^
| T4 | 16 GB | $0.000164 | Light inference, development |
| L4 | 24 GB | $0.000222 | Efficient inference |
| A10G | 24 GB | $0.000306 | General inference, fine-tuning |
| L40S | 48 GB | $0.000389 | Large model inference |
| A100 40GB | 40 GB | $0.000394 | Training, large inference |
| A100 80GB | 80 GB | $0.000463 | Large-scale training |
| H100 | 80 GB | $0.000833 | Maximum performance |

Containers support up to **64 CPUs**, **336 GB RAM**, and **8 GPUs** per instance. Free tier includes $30 in credits.

===== Key Features =====

== Web Endpoints ==
Expose any function as an HTTP or WebSocket endpoint with zero-downtime updates:
  * ''@modal.asgi_app()'' for FastAPI/Starlette ASGI apps
  * ''@modal.web_endpoint()'' for simple request handlers
  * Full WebSocket support for real-time bidirectional messaging
  * Automatic TLS termination and domain routing

== Scheduled Jobs (Cron) ==
Run functions on a schedule:
  * ''@app.function(schedule=modal.Cron("0 */6 * * *"))'' for cron expressions
  * ''@app.function(schedule=modal.Period(hours=1))'' for interval-based scheduling
  * Ideal for data pipelines, model retraining, and periodic agent tasks

== Volumes ==
Persistent, distributed file storage:
  * ''modal.Volume'' creates shared filesystems accessible from any function
  * Optimized for large model weights and datasets
  * Read/write access from multiple concurrent workers

== Secrets Management ==
  * ''modal.Secret.from_name("my-secret")'' for secure credential access
  * Environment variable injection into function containers
  * Integrates with external secret stores

===== Code Example =====

<code python>
import modal

app = modal.App("agent-inference")

# Define container image with dependencies
image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch", "transformers", "accelerate"
)

# GPU-accelerated inference function
@app.function(gpu="A100", image=image, timeout=300)
def generate_text(prompt: str, max_tokens: int = 200) -> str:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch

    model_name = "meta-llama/Llama-3.1-8B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16, device_map="auto"
    )

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=max_tokens)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


# Web endpoint for the inference API
@app.function(image=image)
@modal.asgi_app()
def web():
    from fastapi import FastAPI

    api = FastAPI()

    @api.post("/generate")
    async def generate(prompt: str, max_tokens: int = 200):
        result = generate_text.remote(prompt, max_tokens)
        return {"response": result}

    return api


# Scheduled job: refresh model cache daily
@app.function(schedule=modal.Cron("0 2 * * *"), image=image)
def daily_cache_refresh():
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model_name = "meta-llama/Llama-3.1-8B-Instruct"
    AutoTokenizer.from_pretrained(model_name)
    AutoModelForCausalLM.from_pretrained(model_name)
    print("Model cache refreshed.")
</code>

===== Why Modal for AI Agents =====

Modal has become popular for agent deployment for several reasons:

  * **No DevOps Required** -- Python-first, decorator-based approach eliminates infrastructure complexity
  * **Sub-Second Cold Starts** -- Critical for responsive agent interactions; Rust runtime minimizes startup latency
  * **Cost Efficiency** -- Per-second billing + scale-to-zero means no paying for idle GPUs between agent tasks
  * **Large Resource Limits** -- 64 CPUs, 336 GB RAM, 8 H100s per container handles the most demanding models
  * **WebSocket Support** -- Real-time bidirectional messaging for conversational agents
  * **Rapid Iteration** -- Image caching and instant deploys enable fast development cycles

===== Architecture Diagram =====

<mermaid>
graph TD
    A["Your Code (@app.func / @modal.asgi / @Cron)"] --> B["Modal Platform"]
    B --> C["Image Builder (snapshot + cache)"]
    C --> D["Rust Container Runtime (gVisor)"]
    D --> E["GPU Pool (T4 / A10G / A100 / H100)"]
    B --> F["Ingress / Load Balancer (HTTP + WebSocket)"]
    F --> D
    D --> G["Volumes (persistent storage)"]
    D --> H["Secrets Manager"]
</mermaid>

===== References =====

  * [[https://modal.com/|Modal Website]]
  * [[https://modal.com/docs/guide|Modal Documentation]]
  * [[https://modal.com/blog/serverless-http|Modal: Serverless HTTP Architecture]]
  * [[https://modal.com/blog/serverless-gpu-article|Modal: Serverless GPU Guide]]

===== See Also =====

  * [[e2b|E2B]] -- Sandboxed code execution for AI agents
  * [[autogen_studio|AutoGen Studio]] -- Visual multi-agent workflow builder
  * [[composio|Composio]] -- Tool integration platform for agents
  * [[browser_use|Browser-Use]] -- AI browser automation