SkillsMay 8, 2026·4 min read

OpenRouter Auto Routing — Pick the Best Model per Query

OpenRouter Auto routes each query to the optimal model balancing cost, latency, capability. Set model=openrouter/auto, the router decides per-prompt.

Agent ready

Safe staging for this asset

This asset is staged first. The copied prompt tells the agent to inspect the staged files and ask before activating scripts, MCP config, or global config.

Stage only · 29/100Policy: stage
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Stage only
Trust
Trust: Community
Entrypoint
Asset
Safe staging command
npx -y tokrepo@latest install f9652b65-04fc-40e1-9fec-fa3ef2ec5921 --target codex

Stages files first; activation requires review of the staged README and plan.

Intro

OpenRouter Auto Routing picks the best model for each prompt automatically — analyzing the task, then routing to a balance of cost, latency, and capability. Cheap chitchat goes to Llama 3.3 on Groq; complex code goes to Claude Sonnet; long-context retrieval goes to Gemini Pro. Best for: apps with diverse query types where one fixed model is either too expensive or too weak. Works with: any OpenAI SDK pointing at OpenRouter. Setup time: 1 minute.


Use auto routing

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)

# Each call is routed independently
quick = client.chat.completions.create(
    model="openrouter/auto",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
# → routed to a cheap, fast model (e.g. Llama 3.3 on Groq, ~$0.0001)

complex = client.chat.completions.create(
    model="openrouter/auto",
    messages=[{"role": "user", "content": "Refactor this 500-line Python file..."}],
)
# → routed to a coding model (e.g. Claude Sonnet, ~$0.05)

print(quick.model)    # "meta-llama/llama-3.3-70b-instruct"
print(complex.model)  # "anthropic/claude-3.5-sonnet"

The actual model used is in response.model. Log it with PostHog or Helicone for cost analysis.

Constrain the auto-pool

extra_body = {
    "models": [
        "anthropic/claude-3.5-sonnet",
        "anthropic/claude-3.5-haiku",
        "openai/gpt-4o-mini",
    ],
    # Auto picks the best from THIS list
}

response = client.chat.completions.create(
    model="openrouter/auto",
    messages=[...],
    extra_body=extra_body,
)

Useful when you have data-residency or compliance constraints — only certain providers allowed.

Provider preferences

extra_body = {
    "models": ["openrouter/auto"],
    "provider": {
        "sort": "throughput",        # "price" | "latency" | "throughput"
        "data_collection": "deny",
        "allow_fallbacks": True,
    },
}

sort: price → cheapest provider that meets the prompt's needs. sort: latency → fastest first-byte time. sort: throughput → highest tokens/sec for streaming.

When NOT to use auto

  • You have benchmarked your prompts on one specific model — pinning is safer
  • Compliance requires a specific deployment region (provider-pin instead)
  • You need exact cost predictability (auto = variable cost)

FAQ

Q: How accurate is auto routing? A: Good for low-stakes tasks, mediocre for nuanced ones. The router uses heuristics + a fast classifier on the prompt. For prompts at the boundary (medium complexity) it can pick a model that's slightly under-spec'd. Constrain the pool when stakes matter.

Q: Does auto routing increase latency? A: Negligibly — the routing decision adds ~10-50ms before the actual call. The fastest tier (Groq Llama for chitchat) often more than makes up for it.

Q: Can I see what auto picked? A: Yes — response.model returns the actual model used. Log this for analysis. PostHog LLM Observability shows it as a property on each call.


Quick Use

  1. Have an OpenRouter API key
  2. In any OpenAI SDK call, use model="openrouter/auto"
  3. Optional: pass extra_body={"models": [...], "provider": {"sort": "price"}} to constrain

Intro

OpenRouter Auto Routing picks the best model for each prompt automatically — analyzing the task, then routing to a balance of cost, latency, and capability. Cheap chitchat goes to Llama 3.3 on Groq; complex code goes to Claude Sonnet; long-context retrieval goes to Gemini Pro. Best for: apps with diverse query types where one fixed model is either too expensive or too weak. Works with: any OpenAI SDK pointing at OpenRouter. Setup time: 1 minute.


Use auto routing

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)

# Each call is routed independently
quick = client.chat.completions.create(
    model="openrouter/auto",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
# → routed to a cheap, fast model (e.g. Llama 3.3 on Groq, ~$0.0001)

complex = client.chat.completions.create(
    model="openrouter/auto",
    messages=[{"role": "user", "content": "Refactor this 500-line Python file..."}],
)
# → routed to a coding model (e.g. Claude Sonnet, ~$0.05)

print(quick.model)    # "meta-llama/llama-3.3-70b-instruct"
print(complex.model)  # "anthropic/claude-3.5-sonnet"

The actual model used is in response.model. Log it with PostHog or Helicone for cost analysis.

Constrain the auto-pool

extra_body = {
    "models": [
        "anthropic/claude-3.5-sonnet",
        "anthropic/claude-3.5-haiku",
        "openai/gpt-4o-mini",
    ],
    # Auto picks the best from THIS list
}

response = client.chat.completions.create(
    model="openrouter/auto",
    messages=[...],
    extra_body=extra_body,
)

Useful when you have data-residency or compliance constraints — only certain providers allowed.

Provider preferences

extra_body = {
    "models": ["openrouter/auto"],
    "provider": {
        "sort": "throughput",        # "price" | "latency" | "throughput"
        "data_collection": "deny",
        "allow_fallbacks": True,
    },
}

sort: price → cheapest provider that meets the prompt's needs. sort: latency → fastest first-byte time. sort: throughput → highest tokens/sec for streaming.

When NOT to use auto

  • You have benchmarked your prompts on one specific model — pinning is safer
  • Compliance requires a specific deployment region (provider-pin instead)
  • You need exact cost predictability (auto = variable cost)

FAQ

Q: How accurate is auto routing? A: Good for low-stakes tasks, mediocre for nuanced ones. The router uses heuristics + a fast classifier on the prompt. For prompts at the boundary (medium complexity) it can pick a model that's slightly under-spec'd. Constrain the pool when stakes matter.

Q: Does auto routing increase latency? A: Negligibly — the routing decision adds ~10-50ms before the actual call. The fastest tier (Groq Llama for chitchat) often more than makes up for it.

Q: Can I see what auto picked? A: Yes — response.model returns the actual model used. Log this for analysis. PostHog LLM Observability shows it as a property on each call.


Source & Thanks

Built by OpenRouter. Commercial product.

openrouter.ai/docs — Auto Routing docs

🙏

Source & Thanks

Built by OpenRouter. Commercial product.

openrouter.ai/docs — Auto Routing docs

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets