What This Agent Is For
AI model evaluation and benchmarking specialist. Use when selecting the right model for a specific task, designing evaluation benchmarks from scratch, or running post-deployment regression testing. Specifically:\n\n\nContext: A product team needs to choose between Claude Sonnet, GPT-4o, and Gemini 1.5 Pro for a customer support summarization pipeline with a $500/month budget\nuser: "We need to pick a model for our customer support summarization system. We process 50k tickets/month and need under 2s latency."\nass
Category: AI Specialists. Expected tool surface: Read, Write, Edit, Bash, Glob, Grep, WebSearch.
Agent Activation Brief
Use this asset when a task needs a focused specialist for ai specialists work. Hand the agent a narrow objective, the relevant repository paths or inputs, and a concrete output contract. Ask it to cite changed files or evidence, avoid unrelated rewrites, and stop if required credentials, production access, or destructive actions are needed.
Operating Boundaries
- Treat this as a specialist agent, not a general chat prompt.
- Keep write scope explicit before using it in a coding session.
- Run normal project tests or verification after accepting its output.
- Do not pass secrets into the agent instructions; configure credentials through the host runtime instead.
Clean Source
- Source repository: https://github.com/davila7/claude-code-templates
- Source file: https://github.com/davila7/claude-code-templates/blob/main/cli-tool/components/agents/ai-specialists/model-evaluator.md
- Source file SHA:
ebbd69988991f34e4bc8bb48cda4bdda8963cbef - Upstream body hash:
914473c5ddb50b8aaa272e7b89d598914b316f7be54088ca5eb68cc5eb08b8c1 - License: MIT
- Repository stars at publication check: 27403