Skills2026年4月8日·1 分钟阅读

Replicate — Run AI Models via Simple API Calls

Cloud platform to run open-source AI models with a simple API. Replicate hosts Llama, Stable Diffusion, Whisper, and thousands of models — no GPU setup or Docker required.

Replicate · Community

Agent 就绪

Agent 可直接安装

这个资产可安装；Agent 先选择当前运行时、检查安装计划，再运行匹配命令。

Native · 98/100策略：允许

Agent 入口

任意 MCP/CLI Agent

类型

Skill

安装

Single

信任

信任等级：Community

入口

Replicate — Run AI Models via Simple API Calls

直接安装命令

npx -y tokrepo@latest install e80aca76-b9b8-4330-8611-ee1ead26c99e --target codex

先 dry-run 确认安装计划，再运行此命令。

TL;DR

Replicate runs open-source AI models via API with no GPU setup and pay-per-second billing.

§01

What it is

Replicate is a cloud platform that runs open-source AI models via a simple API. No GPU provisioning, no Docker, no model serving code. Call replicate.run() with a model name and input. It hosts thousands of models including Llama, Stable Diffusion, Whisper, and community fine-tunes.

Replicate targets developers who want to use open-source models without managing GPU infrastructure. It provides Python and Node.js SDKs, an HTTP API, and pay-per-second billing.

§02

How it saves time or tokens

Replicate eliminates the infrastructure overhead of running AI models. Setting up a GPU server, installing CUDA drivers, downloading model weights, and configuring a serving endpoint takes hours. With Replicate, you install the SDK, set your API token, and call the model in 3 lines of code. The pay-per-second billing model means you only pay for actual compute time, not idle GPU instances. Custom models can be deployed using the Cog packaging tool.

§03

How to use

Install the Python SDK:

pip install replicate

Run a text generation model:

import replicate

output = replicate.run(
    'meta/meta-llama-3.1-405b-instruct',
    input={'prompt': 'Explain quantum computing in simple terms'}
)
print(''.join(output))

Generate an image:

output = replicate.run(
    'stability-ai/sdxl:latest',
    input={'prompt': 'A sunset over mountains, oil painting style'}
)
print(output[0])  # Image URL

§04

Example

Deploying a custom model with Cog:

# predict.py for Cog packaging
from cog import BasePredictor, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        self.model = torch.load('model.pth')

    def predict(
        self,
        text: str = Input(description='Input text'),
        temperature: float = Input(
            description='Sampling temperature',
            default=0.7,
            ge=0.0,
            le=2.0
        ),
    ) -> str:
        return self.model.generate(text, temperature)

# Build and push to Replicate
cog push r8.im/your-username/your-model

§05

Related on TokRepo

AI tools for coding — More AI development tools on TokRepo.
Local LLM tools — Compare cloud vs local inference options.

§06

Common pitfalls

Cold starts on infrequently used models can add 10-30 seconds of latency. Use warm model endpoints for production workloads.
Not streaming responses for text generation causes unnecessary waiting. Use the streaming API for real-time token output.
Pay-per-second billing can surprise you with large batch jobs. Estimate costs before running thousands of predictions.

常见问题

How much does Replicate cost?+

Replicate uses pay-per-second billing. Costs vary by model and GPU type. A Llama 3.1 70B inference costs approximately $0.65 per million input tokens. Image generation with SDXL costs a few cents per image. Check replicate.com/pricing for current rates.

Can I deploy custom models on Replicate?+

Yes. Use Cog (Replicate's open-source packaging tool) to containerize your model with a predict.py file. Push to Replicate with cog push and your model gets an API endpoint automatically.

What models does Replicate host?+

Replicate hosts thousands of models including Meta Llama, Stable Diffusion, OpenAI Whisper, Mistral, community fine-tunes, image generation, video generation, and audio models. Browse models at replicate.com/explore.

Does Replicate support streaming?+

Yes. For text generation models, Replicate supports token-level streaming via the Python SDK and HTTP SSE. This reduces time-to-first-token for chat applications.

How does Replicate compare to running models locally?+

Replicate trades infrastructure management for per-request costs. Running locally requires GPU hardware and setup but has no per-request costs. Replicate is ideal for prototyping, low-volume production, and models too large for your hardware.

引用来源 (3)

Replicate— Replicate runs open-source AI models via API
Cog GitHub— Cog packaging tool for custom models
Replicate Python GitHub— Replicate Python SDK

🙏

来源与感谢

replicate.com — Cloud AI model platform

讨论

登录后参与讨论。

还没有评论，来写第一条吧。

Replicate — Run AI Models via Simple API Calls

Agent 可直接安装

What it is

How it saves time or tokens

How to use

Example

Related on TokRepo

Common pitfalls

常见问题

引用来源 (3)

TokRepo 相关

来源与感谢

讨论

相关资产

Replicate Cog — Containerize ML Models with One YAML File

mistral-inference — Run Mistral Models

Jan — Run AI Models Locally on Your Desktop

LocalAI — Run Any AI Model Locally, No GPU