SkillsApr 8, 2026·3 min read

Replicate — Run AI Models via Simple API Calls

Cloud platform to run open-source AI models with a simple API. Replicate hosts Llama, Stable Diffusion, Whisper, and thousands of models — no GPU setup or Docker required.

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Community
Entrypoint
Replicate — Run AI Models via Simple API Calls
Direct install command
npx -y tokrepo@latest install e80aca76-b9b8-4330-8611-ee1ead26c99e --target codex

Run after dry-run confirms the install plan.

TL;DR
Replicate runs open-source AI models via API with no GPU setup and pay-per-second billing.
§01

What it is

Replicate is a cloud platform that runs open-source AI models via a simple API. No GPU provisioning, no Docker, no model serving code. Call replicate.run() with a model name and input. It hosts thousands of models including Llama, Stable Diffusion, Whisper, and community fine-tunes.

Replicate targets developers who want to use open-source models without managing GPU infrastructure. It provides Python and Node.js SDKs, an HTTP API, and pay-per-second billing.

§02

How it saves time or tokens

Replicate eliminates the infrastructure overhead of running AI models. Setting up a GPU server, installing CUDA drivers, downloading model weights, and configuring a serving endpoint takes hours. With Replicate, you install the SDK, set your API token, and call the model in 3 lines of code. The pay-per-second billing model means you only pay for actual compute time, not idle GPU instances. Custom models can be deployed using the Cog packaging tool.

§03

How to use

  1. Install the Python SDK:
pip install replicate
  1. Run a text generation model:
import replicate

output = replicate.run(
    'meta/meta-llama-3.1-405b-instruct',
    input={'prompt': 'Explain quantum computing in simple terms'}
)
print(''.join(output))
  1. Generate an image:
output = replicate.run(
    'stability-ai/sdxl:latest',
    input={'prompt': 'A sunset over mountains, oil painting style'}
)
print(output[0])  # Image URL
§04

Example

Deploying a custom model with Cog:

# predict.py for Cog packaging
from cog import BasePredictor, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        self.model = torch.load('model.pth')

    def predict(
        self,
        text: str = Input(description='Input text'),
        temperature: float = Input(
            description='Sampling temperature',
            default=0.7,
            ge=0.0,
            le=2.0
        ),
    ) -> str:
        return self.model.generate(text, temperature)
# Build and push to Replicate
cog push r8.im/your-username/your-model
§05

Related on TokRepo

§06

Common pitfalls

  • Cold starts on infrequently used models can add 10-30 seconds of latency. Use warm model endpoints for production workloads.
  • Not streaming responses for text generation causes unnecessary waiting. Use the streaming API for real-time token output.
  • Pay-per-second billing can surprise you with large batch jobs. Estimate costs before running thousands of predictions.

Frequently Asked Questions

How much does Replicate cost?+

Replicate uses pay-per-second billing. Costs vary by model and GPU type. A Llama 3.1 70B inference costs approximately $0.65 per million input tokens. Image generation with SDXL costs a few cents per image. Check replicate.com/pricing for current rates.

Can I deploy custom models on Replicate?+

Yes. Use Cog (Replicate's open-source packaging tool) to containerize your model with a predict.py file. Push to Replicate with cog push and your model gets an API endpoint automatically.

What models does Replicate host?+

Replicate hosts thousands of models including Meta Llama, Stable Diffusion, OpenAI Whisper, Mistral, community fine-tunes, image generation, video generation, and audio models. Browse models at replicate.com/explore.

Does Replicate support streaming?+

Yes. For text generation models, Replicate supports token-level streaming via the Python SDK and HTTP SSE. This reduces time-to-first-token for chat applications.

How does Replicate compare to running models locally?+

Replicate trades infrastructure management for per-request costs. Running locally requires GPU hardware and setup but has no per-request costs. Replicate is ideal for prototyping, low-volume production, and models too large for your hardware.

Citations (3)
🙏

Source & Thanks

Created by Replicate.

replicate.com — Run AI models in the cloud

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets