ScriptsApr 8, 2026·2 min read

Modal — Serverless GPU Cloud for AI Workloads

Run GPU workloads in the cloud with Python decorators. Modal provides serverless A100/H100 GPUs for model inference, fine-tuning, and batch jobs with zero infrastructure.

TL;DR
Run GPU workloads in the cloud with Python decorators. Serverless A100/H100 for inference and fine-tuning.
§01

What it is

Modal is a serverless cloud platform for running GPU workloads. You write Python functions, decorate them with @app.function(gpu='A100'), and Modal handles provisioning GPU instances, installing dependencies, and scaling. There is no infrastructure to manage: no Docker files, no Kubernetes, no cloud console. Modal supports A100, H100, and T4 GPUs for model inference, fine-tuning, batch processing, and web endpoints.

ML engineers, AI researchers, and developers who need GPU compute without infrastructure management benefit from Modal. It is particularly useful for workloads that are too bursty or infrequent to justify dedicated GPU instances.

§02

How it saves time or tokens

Modal eliminates the hours spent setting up GPU infrastructure. No CUDA driver installation, no Docker image building, no autoscaling configuration. Cold start times are measured in seconds. You pay only for the GPU time you use (per-second billing), making it cost-effective for workloads that run for minutes or hours rather than continuously.

§03

How to use

  1. Install the Modal SDK and run modal setup for one-time authentication
  2. Write a Python function decorated with @app.function(gpu='A100')
  3. Run modal run script.py to execute on a cloud GPU
§04

Example

import modal

app = modal.App('my-ai-app')

@app.function(gpu='A100')
def run_inference(prompt: str) -> str:
    from transformers import pipeline
    pipe = pipeline('text-generation', model='meta-llama/Llama-3-8B-Instruct')
    result = pipe(prompt, max_new_tokens=256)
    return result[0]['generated_text']

@app.local_entrypoint()
def main():
    output = run_inference.remote('Explain quantum computing.')
    print(output)
pip install modal
modal setup  # one-time auth
modal run inference.py
§05

Related on TokRepo

§06

Common pitfalls

  • Cold starts add a few seconds on first invocation; use keep_warm=1 for latency-sensitive endpoints
  • Large model downloads happen on every cold start unless you use Modal's volume mounts to cache weights
  • GPU availability varies by type; H100s may have wait times during peak demand periods

Frequently Asked Questions

How much does Modal cost?+

Modal charges per-second for GPU usage. An A100 costs approximately $3-4/hour. There is a free tier with $30/month of compute credits. No upfront commitment or reserved instances are required.

Which GPU types does Modal support?+

Modal offers T4, A10G, L4, A100 (40GB and 80GB), and H100 GPUs. You specify the GPU type in your function decorator, and Modal provisions the right hardware automatically.

Can I deploy web endpoints on Modal?+

Yes. Use the @app.web_endpoint() decorator to deploy a function as an HTTPS endpoint. Modal handles SSL, routing, and autoscaling. Endpoints can serve model inference via REST API.

How do I cache model weights?+

Use Modal Volumes to persist model weights across invocations. Download the model once to a volume, then mount it in subsequent runs. This eliminates repeated downloads and reduces cold start time.

Can I fine-tune models on Modal?+

Yes. Modal provides the GPU compute and storage needed for fine-tuning. You write your training script in Python, specify the GPU type, and Modal handles the infrastructure. Multi-GPU training with PyTorch DDP is supported.

Citations (3)
🙏

Source & Thanks

Created by Modal.

modal.com — Serverless GPU cloud

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets