SkillsApr 8, 2026·3 min read

Replicate — Run AI Models via Simple API Calls

Cloud platform to run open-source AI models with a simple API. Replicate hosts Llama, Stable Diffusion, Whisper, and thousands of models — no GPU setup or Docker required.

Replicate · Community

Agent ready

Ready-to-run agent install

This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.

Native · 98/100Policy: allow

Agent surface

Any MCP/CLI agent

Kind

Skill

Install

Single

Trust

Trust: Community

Entrypoint

Replicate — Run AI Models via Simple API Calls

Direct install command

npx -y tokrepo@latest install e80aca76-b9b8-4330-8611-ee1ead26c99e --target codex

Run after dry-run confirms the install plan.

TL;DR

Replicate runs open-source AI models via API with no GPU setup and pay-per-second billing.

§01

What it is

Replicate is a cloud platform that runs open-source AI models via a simple API. No GPU provisioning, no Docker, no model serving code. Call replicate.run() with a model name and input. It hosts thousands of models including Llama, Stable Diffusion, Whisper, and community fine-tunes.

Replicate targets developers who want to use open-source models without managing GPU infrastructure. It provides Python and Node.js SDKs, an HTTP API, and pay-per-second billing.

§02

How it saves time or tokens

Replicate eliminates the infrastructure overhead of running AI models. Setting up a GPU server, installing CUDA drivers, downloading model weights, and configuring a serving endpoint takes hours. With Replicate, you install the SDK, set your API token, and call the model in 3 lines of code. The pay-per-second billing model means you only pay for actual compute time, not idle GPU instances. Custom models can be deployed using the Cog packaging tool.

§03

How to use

Install the Python SDK:

pip install replicate

Run a text generation model:

import replicate

output = replicate.run(
    'meta/meta-llama-3.1-405b-instruct',
    input={'prompt': 'Explain quantum computing in simple terms'}
)
print(''.join(output))

Generate an image:

output = replicate.run(
    'stability-ai/sdxl:latest',
    input={'prompt': 'A sunset over mountains, oil painting style'}
)
print(output[0])  # Image URL

§04

Example

Deploying a custom model with Cog:

# predict.py for Cog packaging
from cog import BasePredictor, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        self.model = torch.load('model.pth')

    def predict(
        self,
        text: str = Input(description='Input text'),
        temperature: float = Input(
            description='Sampling temperature',
            default=0.7,
            ge=0.0,
            le=2.0
        ),
    ) -> str:
        return self.model.generate(text, temperature)

# Build and push to Replicate
cog push r8.im/your-username/your-model

§05

Related on TokRepo

AI tools for coding — More AI development tools on TokRepo.
Local LLM tools — Compare cloud vs local inference options.

§06

Common pitfalls

Cold starts on infrequently used models can add 10-30 seconds of latency. Use warm model endpoints for production workloads.
Not streaming responses for text generation causes unnecessary waiting. Use the streaming API for real-time token output.
Pay-per-second billing can surprise you with large batch jobs. Estimate costs before running thousands of predictions.

Frequently Asked Questions

How much does Replicate cost?+

Replicate uses pay-per-second billing. Costs vary by model and GPU type. A Llama 3.1 70B inference costs approximately $0.65 per million input tokens. Image generation with SDXL costs a few cents per image. Check replicate.com/pricing for current rates.

Can I deploy custom models on Replicate?+

Yes. Use Cog (Replicate's open-source packaging tool) to containerize your model with a predict.py file. Push to Replicate with cog push and your model gets an API endpoint automatically.

What models does Replicate host?+

Replicate hosts thousands of models including Meta Llama, Stable Diffusion, OpenAI Whisper, Mistral, community fine-tunes, image generation, video generation, and audio models. Browse models at replicate.com/explore.

Does Replicate support streaming?+

Yes. For text generation models, Replicate supports token-level streaming via the Python SDK and HTTP SSE. This reduces time-to-first-token for chat applications.

How does Replicate compare to running models locally?+

Replicate trades infrastructure management for per-request costs. Running locally requires GPU hardware and setup but has no per-request costs. Replicate is ideal for prototyping, low-volume production, and models too large for your hardware.

Citations (3)

Replicate— Replicate runs open-source AI models via API
Cog GitHub— Cog packaging tool for custom models
Replicate Python GitHub— Replicate Python SDK

Related on TokRepo

Coding tools Local LLM tools Featured workflows

🙏

Source & Thanks

Created by Replicate.

replicate.com — Run AI models in the cloud

Discussion

No comments yet. Be the first to share your thoughts.

Related Assets

Replicate Cog — Containerize ML Models with One YAML File

Cog is Replicate's open-source tool to wrap an ML model in a Docker container. One cog.yaml + predict.py gives you a portable, GPU-aware HTTP model.

Skills

Replicate

mistral-inference — Run Mistral Models

Run Mistral models with minimal inference code. Install via pip, load a model, and build a local workflow before moving to larger deployments.

Skills

AI Open Source

Jan — Run AI Models Locally on Your Desktop

Open-source desktop app to run LLMs offline. Jan supports Llama, Mistral, and Gemma models with one-click download, OpenAI-compatible API, and full privacy.

Skills

Skill Factory

LocalAI — Run Any AI Model Locally, No GPU

LocalAI is an open-source AI engine running LLMs, vision, voice, and image models locally. 44.6K+ GitHub stars. OpenAI/Anthropic-compatible API, 35+ backends, MCP, agents. MIT licensed.

Skills

AI Open Source