ScriptsApr 8, 2026·2 min read

Modal — Serverless GPU Cloud for AI Workloads

Run GPU workloads in the cloud with Python decorators. Modal provides serverless A100/H100 GPUs for model inference, fine-tuning, and batch jobs with zero infrastructure.

AI
AI Open Source · Community
Quick Use

Use it first, then decide how deep to go

This block should tell both the user and the agent what to copy, install, and apply first.

pip install modal
modal setup  # One-time auth
import modal

app = modal.App("my-ai-app")

@app.function(gpu="A100")
def run_inference(prompt: str) -> str:
    from transformers import pipeline
    pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct", device="cuda")
    return pipe(prompt, max_new_tokens=256)[0]["generated_text"]

@app.local_entrypoint()
def main():
    result = run_inference.remote("Explain quantum computing")
    print(result)
modal run my_app.py

What is Modal?

Modal is a serverless GPU cloud platform where you define cloud functions with Python decorators. Add @app.function(gpu="A100") to any function and it runs on cloud GPUs — no Docker, no Kubernetes, no SSH. Modal handles container building, GPU provisioning, scaling, and shutdown automatically. Pay per second of compute.

Answer-Ready: Modal is a serverless GPU cloud for AI. Python decorators turn local functions into cloud GPU jobs. A100/H100 GPUs, auto-scaling, per-second billing. No Docker or K8s needed. Used for inference, fine-tuning, and batch processing. The simplest path from laptop to cloud GPU.

Best for: ML engineers needing cloud GPUs without infrastructure hassle. Works with: Any Python ML library, PyTorch, HuggingFace, vLLM. Setup time: Under 3 minutes.

Core Features

1. GPU Selection

@app.function(gpu="T4")       # Budget inference
@app.function(gpu="A10G")     # Mid-range
@app.function(gpu="A100")     # Standard training/inference
@app.function(gpu="H100")     # Maximum performance
@app.function(gpu="A100:4")   # Multi-GPU

2. Container Definition (No Dockerfile)

image = (
    modal.Image.debian_slim()
    .pip_install("torch", "transformers", "accelerate")
    .run_commands("apt-get install -y ffmpeg")
)

@app.function(image=image, gpu="A100")
def train():
    ...

3. Web Endpoints

@app.function(gpu="A100")
@modal.web_endpoint()
def generate(prompt: str):
    return {"text": run_model(prompt)}

# Deployed at: https://your-app--generate.modal.run

4. Scheduled Jobs

@app.function(schedule=modal.Cron("0 */6 * * *"))
def batch_process():
    # Runs every 6 hours
    ...

5. Volumes (Persistent Storage)

vol = modal.Volume.from_name("model-cache", create_if_missing=True)

@app.function(volumes={"/models": vol}, gpu="A100")
def inference():
    # Models cached across runs
    model = load_model("/models/llama-3.1")

Pricing

GPU Price/hour Best For
T4 $0.59 Light inference
A10G $1.10 Medium workloads
A100 40GB $3.72 Training/inference
A100 80GB $4.58 Large models
H100 $6.98 Maximum speed

Per-second billing. No minimum.

Modal vs Alternatives

Feature Modal Replicate RunPod Lambda
Interface Python decorators API calls SSH/Docker SSH/Docker
Setup 3 minutes 2 minutes 10 minutes 15 minutes
Custom code Full control Cog format Full control Full control
Auto-scaling Yes Yes Manual Manual
Web endpoints Built-in No Manual Manual
Cold start ~30s ~15s None (always-on) None

FAQ

Q: How fast is cold start? A: ~30 seconds for first run. Warm containers respond in <1 second. Use keep_warm=1 for always-on.

Q: Can I fine-tune models? A: Yes, full GPU access. Run any PyTorch/HuggingFace training loop on A100/H100.

Q: How does billing work? A: Per-second billing for GPU time. Container build time is free. No charges when idle.

🙏

Source & Thanks

Created by Modal.

modal.com — Serverless GPU cloud

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets