Replicate — Run AI Models via Simple API Calls
Cloud platform to run open-source AI models with a simple API. Replicate hosts Llama, Stable Diffusion, Whisper, and thousands of models — no GPU setup or Docker required.
Ready-to-run agent install
This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.
npx -y tokrepo@latest install e80aca76-b9b8-4330-8611-ee1ead26c99e --target codexRun after dry-run confirms the install plan.
What it is
Replicate is a cloud platform that runs open-source AI models via a simple API. No GPU provisioning, no Docker, no model serving code. Call replicate.run() with a model name and input. It hosts thousands of models including Llama, Stable Diffusion, Whisper, and community fine-tunes.
Replicate targets developers who want to use open-source models without managing GPU infrastructure. It provides Python and Node.js SDKs, an HTTP API, and pay-per-second billing.
How it saves time or tokens
Replicate eliminates the infrastructure overhead of running AI models. Setting up a GPU server, installing CUDA drivers, downloading model weights, and configuring a serving endpoint takes hours. With Replicate, you install the SDK, set your API token, and call the model in 3 lines of code. The pay-per-second billing model means you only pay for actual compute time, not idle GPU instances. Custom models can be deployed using the Cog packaging tool.
How to use
- Install the Python SDK:
pip install replicate
- Run a text generation model:
import replicate
output = replicate.run(
'meta/meta-llama-3.1-405b-instruct',
input={'prompt': 'Explain quantum computing in simple terms'}
)
print(''.join(output))
- Generate an image:
output = replicate.run(
'stability-ai/sdxl:latest',
input={'prompt': 'A sunset over mountains, oil painting style'}
)
print(output[0]) # Image URL
Example
Deploying a custom model with Cog:
# predict.py for Cog packaging
from cog import BasePredictor, Input
import torch
class Predictor(BasePredictor):
def setup(self):
self.model = torch.load('model.pth')
def predict(
self,
text: str = Input(description='Input text'),
temperature: float = Input(
description='Sampling temperature',
default=0.7,
ge=0.0,
le=2.0
),
) -> str:
return self.model.generate(text, temperature)
# Build and push to Replicate
cog push r8.im/your-username/your-model
Related on TokRepo
- AI tools for coding — More AI development tools on TokRepo.
- Local LLM tools — Compare cloud vs local inference options.
Common pitfalls
- Cold starts on infrequently used models can add 10-30 seconds of latency. Use warm model endpoints for production workloads.
- Not streaming responses for text generation causes unnecessary waiting. Use the streaming API for real-time token output.
- Pay-per-second billing can surprise you with large batch jobs. Estimate costs before running thousands of predictions.
Frequently Asked Questions
Replicate uses pay-per-second billing. Costs vary by model and GPU type. A Llama 3.1 70B inference costs approximately $0.65 per million input tokens. Image generation with SDXL costs a few cents per image. Check replicate.com/pricing for current rates.
Yes. Use Cog (Replicate's open-source packaging tool) to containerize your model with a predict.py file. Push to Replicate with cog push and your model gets an API endpoint automatically.
Replicate hosts thousands of models including Meta Llama, Stable Diffusion, OpenAI Whisper, Mistral, community fine-tunes, image generation, video generation, and audio models. Browse models at replicate.com/explore.
Yes. For text generation models, Replicate supports token-level streaming via the Python SDK and HTTP SSE. This reduces time-to-first-token for chat applications.
Replicate trades infrastructure management for per-request costs. Running locally requires GPU hardware and setup but has no per-request costs. Replicate is ideal for prototyping, low-volume production, and models too large for your hardware.
Citations (3)
- Replicate— Replicate runs open-source AI models via API
- Cog GitHub— Cog packaging tool for custom models
- Replicate Python GitHub— Replicate Python SDK
Related on TokRepo
Source & Thanks
Created by Replicate.
replicate.com — Run AI models in the cloud
Discussion
Related Assets
Replicate Cog — Containerize ML Models with One YAML File
Cog is Replicate's open-source tool to wrap an ML model in a Docker container. One cog.yaml + predict.py gives you a portable, GPU-aware HTTP model.
mistral-inference — Run Mistral Models
Run Mistral models with minimal inference code. Install via pip, load a model, and build a local workflow before moving to larger deployments.
Jan — Run AI Models Locally on Your Desktop
Open-source desktop app to run LLMs offline. Jan supports Llama, Mistral, and Gemma models with one-click download, OpenAI-compatible API, and full privacy.
LocalAI — Run Any AI Model Locally, No GPU
LocalAI is an open-source AI engine running LLMs, vision, voice, and image models locally. 44.6K+ GitHub stars. OpenAI/Anthropic-compatible API, 35+ backends, MCP, agents. MIT licensed.