Skills2026年4月8日·1 分钟阅读

Replicate — Run AI Models via Simple API Calls

Cloud platform to run open-source AI models with a simple API. Replicate hosts Llama, Stable Diffusion, Whisper, and thousands of models — no GPU setup or Docker required.

Agent 就绪

Agent 可直接安装

这个资产可安装;Agent 先选择当前运行时、检查安装计划,再运行匹配命令。

Native · 98/100策略:允许
Agent 入口
任意 MCP/CLI Agent
类型
Skill
安装
Single
信任
信任等级:Community
入口
Replicate — Run AI Models via Simple API Calls
直接安装命令
npx -y tokrepo@latest install e80aca76-b9b8-4330-8611-ee1ead26c99e --target codex

先 dry-run 确认安装计划,再运行此命令。

TL;DR
Replicate runs open-source AI models via API with no GPU setup and pay-per-second billing.
§01

What it is

Replicate is a cloud platform that runs open-source AI models via a simple API. No GPU provisioning, no Docker, no model serving code. Call replicate.run() with a model name and input. It hosts thousands of models including Llama, Stable Diffusion, Whisper, and community fine-tunes.

Replicate targets developers who want to use open-source models without managing GPU infrastructure. It provides Python and Node.js SDKs, an HTTP API, and pay-per-second billing.

§02

How it saves time or tokens

Replicate eliminates the infrastructure overhead of running AI models. Setting up a GPU server, installing CUDA drivers, downloading model weights, and configuring a serving endpoint takes hours. With Replicate, you install the SDK, set your API token, and call the model in 3 lines of code. The pay-per-second billing model means you only pay for actual compute time, not idle GPU instances. Custom models can be deployed using the Cog packaging tool.

§03

How to use

  1. Install the Python SDK:
pip install replicate
  1. Run a text generation model:
import replicate

output = replicate.run(
    'meta/meta-llama-3.1-405b-instruct',
    input={'prompt': 'Explain quantum computing in simple terms'}
)
print(''.join(output))
  1. Generate an image:
output = replicate.run(
    'stability-ai/sdxl:latest',
    input={'prompt': 'A sunset over mountains, oil painting style'}
)
print(output[0])  # Image URL
§04

Example

Deploying a custom model with Cog:

# predict.py for Cog packaging
from cog import BasePredictor, Input
import torch

class Predictor(BasePredictor):
    def setup(self):
        self.model = torch.load('model.pth')

    def predict(
        self,
        text: str = Input(description='Input text'),
        temperature: float = Input(
            description='Sampling temperature',
            default=0.7,
            ge=0.0,
            le=2.0
        ),
    ) -> str:
        return self.model.generate(text, temperature)
# Build and push to Replicate
cog push r8.im/your-username/your-model
§05

Related on TokRepo

§06

Common pitfalls

  • Cold starts on infrequently used models can add 10-30 seconds of latency. Use warm model endpoints for production workloads.
  • Not streaming responses for text generation causes unnecessary waiting. Use the streaming API for real-time token output.
  • Pay-per-second billing can surprise you with large batch jobs. Estimate costs before running thousands of predictions.

常见问题

How much does Replicate cost?+

Replicate uses pay-per-second billing. Costs vary by model and GPU type. A Llama 3.1 70B inference costs approximately $0.65 per million input tokens. Image generation with SDXL costs a few cents per image. Check replicate.com/pricing for current rates.

Can I deploy custom models on Replicate?+

Yes. Use Cog (Replicate's open-source packaging tool) to containerize your model with a predict.py file. Push to Replicate with cog push and your model gets an API endpoint automatically.

What models does Replicate host?+

Replicate hosts thousands of models including Meta Llama, Stable Diffusion, OpenAI Whisper, Mistral, community fine-tunes, image generation, video generation, and audio models. Browse models at replicate.com/explore.

Does Replicate support streaming?+

Yes. For text generation models, Replicate supports token-level streaming via the Python SDK and HTTP SSE. This reduces time-to-first-token for chat applications.

How does Replicate compare to running models locally?+

Replicate trades infrastructure management for per-request costs. Running locally requires GPU hardware and setup but has no per-request costs. Replicate is ideal for prototyping, low-volume production, and models too large for your hardware.

引用来源 (3)
🙏

来源与感谢

replicate.com — Cloud AI model platform

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产