====== Modal Compute ====== **Modal** is a serverless compute platform purpose-built for AI and ML workloads. It allows developers to run GPU-accelerated code, deploy web endpoints, and schedule batch jobs by simply decorating Python functions -- no Docker, Kubernetes, or cloud configuration required. With sub-second cold starts, per-second billing, and access to GPUs from A10G to H100, Modal has become the platform of choice for deploying AI agents, running inference, fine-tuning models, and executing compute-intensive agent tasks. ===== Architecture ===== Modal's architecture is optimized for compute-intensive, bursty workloads: * **Custom Container Runtime** -- Written in Rust for performance, based on gVisor (not Kubernetes). Achieves sub-second cold starts through aggressive container snapshotting. * **Multi-Cloud GPU Pool** -- Aggregates GPU capacity across multiple cloud providers (including Oracle Cloud Infrastructure), dynamically routing workloads to optimal locations based on availability and cost. * **Ingress Layer** -- TCP network load balancer, Caddy reverse proxy, and modal-http routing to serverless function invocation. Supports both HTTP and full WebSocket connections. * **Scale-to-Zero** -- Containers automatically spin down when idle and scale up on demand, ensuring you only pay for actual compute time. * **Image Caching** -- Container images are cached and snapshotted for rapid iteration. Dependency installation happens at build time, not runtime. ===== How It Works ===== Modal's core abstraction is the **decorated Python function**: - Write a regular Python function - Add a ''@app.function()'' decorator specifying compute requirements (GPU type, memory, image) - Modal builds a container image, snapshots it, and stores it - On invocation, Modal provisions a container with the specified GPU in under a second - The function executes, returns results, and the container tears down - You pay per-second for the exact compute time used No YAML, no Dockerfile, no cloud console -- everything is defined in Python. ===== GPU Types and Pricing ===== Modal provides on-demand access to a range of NVIDIA GPUs: ^ GPU ^ VRAM ^ Approx. Cost/sec ^ Use Case ^ | T4 | 16 GB | $0.000164 | Light inference, development | | L4 | 24 GB | $0.000222 | Efficient inference | | A10G | 24 GB | $0.000306 | General inference, fine-tuning | | L40S | 48 GB | $0.000389 | Large model inference | | A100 40GB | 40 GB | $0.000394 | Training, large inference | | A100 80GB | 80 GB | $0.000463 | Large-scale training | | H100 | 80 GB | $0.000833 | Maximum performance | Containers support up to **64 CPUs**, **336 GB RAM**, and **8 GPUs** per instance. Free tier includes $30 in credits. ===== Key Features ===== == Web Endpoints == Expose any function as an HTTP or WebSocket endpoint with zero-downtime updates: * ''@modal.asgi_app()'' for FastAPI/Starlette ASGI apps * ''@modal.web_endpoint()'' for simple request handlers * Full WebSocket support for real-time bidirectional messaging * Automatic TLS termination and domain routing == Scheduled Jobs (Cron) == Run functions on a schedule: * ''@app.function(schedule=modal.Cron("0 */6 * * *"))'' for cron expressions * ''@app.function(schedule=modal.Period(hours=1))'' for interval-based scheduling * Ideal for data pipelines, model retraining, and periodic agent tasks == Volumes == Persistent, distributed file storage: * ''modal.Volume'' creates shared filesystems accessible from any function * Optimized for large model weights and datasets * Read/write access from multiple concurrent workers == Secrets Management == * ''modal.Secret.from_name("my-secret")'' for secure credential access * Environment variable injection into function containers * Integrates with external secret stores ===== Code Example ===== import modal app = modal.App("agent-inference") # Define container image with dependencies image = modal.Image.debian_slim(python_version="3.11").pip_install( "torch", "transformers", "accelerate" ) # GPU-accelerated inference function @app.function(gpu="A100", image=image, timeout=300) def generate_text(prompt: str, max_tokens: int = 200) -> str: from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "meta-llama/Llama-3.1-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" ) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=max_tokens) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Web endpoint for the inference API @app.function(image=image) @modal.asgi_app() def web(): from fastapi import FastAPI api = FastAPI() @api.post("/generate") async def generate(prompt: str, max_tokens: int = 200): result = generate_text.remote(prompt, max_tokens) return {"response": result} return api # Scheduled job: refresh model cache daily @app.function(schedule=modal.Cron("0 2 * * *"), image=image) def daily_cache_refresh(): from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "meta-llama/Llama-3.1-8B-Instruct" AutoTokenizer.from_pretrained(model_name) AutoModelForCausalLM.from_pretrained(model_name) print("Model cache refreshed.") ===== Why Modal for AI Agents ===== Modal has become popular for agent deployment for several reasons: * **No DevOps Required** -- Python-first, decorator-based approach eliminates infrastructure complexity * **Sub-Second Cold Starts** -- Critical for responsive agent interactions; Rust runtime minimizes startup latency * **Cost Efficiency** -- Per-second billing + scale-to-zero means no paying for idle GPUs between agent tasks * **Large Resource Limits** -- 64 CPUs, 336 GB RAM, 8 H100s per container handles the most demanding models * **WebSocket Support** -- Real-time bidirectional messaging for conversational agents * **Rapid Iteration** -- Image caching and instant deploys enable fast development cycles ===== Architecture Diagram ===== graph TD A["Your Code (@app.func / @modal.asgi / @Cron)"] --> B["Modal Platform"] B --> C["Image Builder (snapshot + cache)"] C --> D["Rust Container Runtime (gVisor)"] D --> E["GPU Pool (T4 / A10G / A100 / H100)"] B --> F["Ingress / Load Balancer (HTTP + WebSocket)"] F --> D D --> G["Volumes (persistent storage)"] D --> H["Secrets Manager"] ===== References ===== * [[https://modal.com/|Modal Website]] * [[https://modal.com/docs/guide|Modal Documentation]] * [[https://modal.com/blog/serverless-http|Modal: Serverless HTTP Architecture]] * [[https://modal.com/blog/serverless-gpu-article|Modal: Serverless GPU Guide]] ===== See Also ===== * [[e2b|E2B]] -- Sandboxed code execution for AI agents * [[autogen_studio|AutoGen Studio]] -- Visual multi-agent workflow builder * [[composio|Composio]] -- Tool integration platform for agents * [[browser_use|Browser-Use]] -- AI browser automation