Modal — Serverless GPU Cloud for AI Workloads
Run GPU workloads in the cloud with Python decorators. Modal provides serverless A100/H100 GPUs for model inference, fine-tuning, and batch jobs with zero infrastructure.
Ready-to-run agent install
This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.
npx -y tokrepo@latest install a3ae2bd0-8b48-4cdd-9bb0-9c84f5272408 --target codexRun after dry-run confirms the install plan.
What it is
Modal is a serverless cloud platform for running GPU workloads. You write Python functions, decorate them with @app.function(gpu='A100'), and Modal handles provisioning GPU instances, installing dependencies, and scaling. There is no infrastructure to manage: no Docker files, no Kubernetes, no cloud console. Modal supports A100, H100, and T4 GPUs for model inference, fine-tuning, batch processing, and web endpoints.
ML engineers, AI researchers, and developers who need GPU compute without infrastructure management benefit from Modal. It is particularly useful for workloads that are too bursty or infrequent to justify dedicated GPU instances.
How it saves time or tokens
Modal eliminates the hours spent setting up GPU infrastructure. No CUDA driver installation, no Docker image building, no autoscaling configuration. Cold start times are measured in seconds. You pay only for the GPU time you use (per-second billing), making it cost-effective for workloads that run for minutes or hours rather than continuously.
How to use
- Install the Modal SDK and run
modal setupfor one-time authentication - Write a Python function decorated with
@app.function(gpu='A100') - Run
modal run script.pyto execute on a cloud GPU
Example
import modal
app = modal.App('my-ai-app')
@app.function(gpu='A100')
def run_inference(prompt: str) -> str:
from transformers import pipeline
pipe = pipeline('text-generation', model='meta-llama/Llama-3-8B-Instruct')
result = pipe(prompt, max_new_tokens=256)
return result[0]['generated_text']
@app.local_entrypoint()
def main():
output = run_inference.remote('Explain quantum computing.')
print(output)
pip install modal
modal setup # one-time auth
modal run inference.py
Related on TokRepo
- AI tools for coding — Browse AI development tools and platforms
- Featured workflows — Discover top-rated workflows
Common pitfalls
- Cold starts add a few seconds on first invocation; use
keep_warm=1for latency-sensitive endpoints - Large model downloads happen on every cold start unless you use Modal's volume mounts to cache weights
- GPU availability varies by type; H100s may have wait times during peak demand periods
Frequently Asked Questions
Modal charges per-second for GPU usage. An A100 costs approximately $3-4/hour. There is a free tier with $30/month of compute credits. No upfront commitment or reserved instances are required.
Modal offers T4, A10G, L4, A100 (40GB and 80GB), and H100 GPUs. You specify the GPU type in your function decorator, and Modal provisions the right hardware automatically.
Yes. Use the @app.web_endpoint() decorator to deploy a function as an HTTPS endpoint. Modal handles SSL, routing, and autoscaling. Endpoints can serve model inference via REST API.
Use Modal Volumes to persist model weights across invocations. Download the model once to a volume, then mount it in subsequent runs. This eliminates repeated downloads and reduces cold start time.
Yes. Modal provides the GPU compute and storage needed for fine-tuning. You write your training script in Python, specify the GPU type, and Modal handles the infrastructure. Multi-GPU training with PyTorch DDP is supported.
Citations (3)
- Modal Website— Serverless GPU cloud with Python decorators
- Modal Documentation— A100/H100 GPU support with per-second billing
- Modal GitHub— Serverless infrastructure for model inference and fine-tuning
Related on TokRepo
Source & Thanks
Discussion
Related Assets
modal-examples — Serverless LLM Jobs on Modal
Learn production patterns for serverless jobs (LLM inference, data pipelines) using Modal’s official examples. Run one and adapt it to your workload.
Serverless Framework — Build and Deploy Serverless Apps to Any Cloud
The most widely adopted toolkit for building serverless applications on AWS Lambda, Azure Functions, Google Cloud Functions, and more. Define infrastructure and functions in a single YAML file and deploy with one command.
Apache OpenWhisk — Open Source Serverless Cloud Platform
Apache OpenWhisk is a serverless functions platform that lets you deploy event-driven code in any language without managing servers, with support for composable action sequences and rich trigger integrations.
Modal Sandboxes — Secure Cloud Code Execution for AI Agents
Modal Sandboxes spin up secure Linux environments for agent-generated code in seconds. Custom images, GPUs, persistent volumes from any Modal Function.