SkillsApr 8, 2026·1 min read

Together AI Dedicated Endpoints Skill for Agents

Skill that teaches Claude Code Together AI's dedicated endpoints API. Deploy single-tenant GPU inference with autoscaling, no rate limits, and custom model configurations.

TL;DR
This skill teaches coding agents how to provision and manage Together AI dedicated endpoints with autoscaling on single-tenant GPUs.
§01

What it is

Together AI Dedicated Endpoints Skill is a configuration package that teaches AI coding agents how to provision and manage dedicated GPU inference endpoints on Together AI. It covers creating endpoints with specific hardware (H100, H200, A100), configuring autoscaling, and managing the endpoint lifecycle through the Together API.

This skill targets teams deploying LLMs for production inference at scale who need single-tenant GPU resources with no rate limits and custom model configurations. It works with Claude Code, Cursor, and Codex CLI.

§02

How it saves time or tokens

The skill encodes Together AI's API patterns, hardware options, and best practices directly into the agent's context. Instead of reading documentation and writing boilerplate API calls, the agent generates correct endpoint provisioning code on the first attempt. Autoscaling configuration ensures you pay only for active inference capacity.

§03

How to use

  1. Install the skill: npx skills add togethercomputer/skills.
  2. Ask your AI coding agent to create a dedicated endpoint for a specific model.
  3. The agent generates Python code using the Together SDK with correct hardware and scaling parameters.
§04

Example

from together import Together

client = Together()

# Create a dedicated endpoint
endpoint = client.endpoints.create(
    model='meta-llama/Llama-3.3-70B-Instruct-Turbo',
    hardware='gpu-h100-80gb',
    min_replicas=1,
    max_replicas=4,
    autoscale=True,
)
print(f'Endpoint URL: {endpoint.url}')

# Scale the endpoint
client.endpoints.update(endpoint.id, min_replicas=2)

# Check status
status = client.endpoints.get(endpoint.id)
print(f'Status: {status.state}')
§05

Related on TokRepo

§06

Common pitfalls

  • Dedicated endpoints incur per-hour costs regardless of request volume; scale down min_replicas during off-peak hours.
  • Model availability varies by hardware type; check Together AI's model compatibility matrix before selecting GPU hardware.
  • Autoscaling has a cold start delay when scaling from zero replicas; keep min_replicas at 1 for latency-sensitive applications.

Frequently Asked Questions

What hardware options are available?+

Together AI offers H100 80GB, H200 141GB, and A100 80GB GPUs. H100 is recommended for most large models, H200 for the largest models that exceed 80GB VRAM, and A100 for cost-effective deployments.

How does autoscaling work?+

Autoscaling adjusts the number of replicas based on request load between your configured min and max replicas. It scales up when queue depth increases and scales down when load decreases.

Can I deploy custom fine-tuned models?+

Yes. Together AI supports deploying custom models that you have fine-tuned on their platform or uploaded. Specify the model path when creating the endpoint.

What is the difference between dedicated and serverless endpoints?+

Dedicated endpoints run on single-tenant GPUs reserved for your workload with no rate limits. Serverless endpoints share infrastructure with other users and have rate limits but cost less for intermittent usage.

How do I monitor endpoint performance?+

Use the Together SDK to query endpoint metrics including request count, latency, and queue depth. The Together AI dashboard also provides visual monitoring and alerting.

Citations (3)
🙏

Source & Thanks

Part of togethercomputer/skills — MIT licensed.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets