# Together AI Evaluations Skill for Claude Code > Skill that teaches Claude Code Together AI's LLM evaluation framework. Run LLM-as-a-judge evaluations to score model outputs on quality, safety, and task completion. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: ## Quick Use ```bash npx skills add togethercomputer/skills ``` ## What is This Skill? This skill teaches AI coding agents how to use Together AI's evaluation framework. It provides LLM-as-a-judge patterns for scoring model outputs on quality, safety, helpfulness, and task-specific criteria. **Answer-Ready**: Together AI Evaluations Skill for coding agents. LLM-as-a-judge framework for scoring model outputs. Evaluate quality, safety, and task completion automatically. Part of official 12-skill collection. **Best for**: ML teams evaluating and comparing LLM outputs. **Works with**: Claude Code, Cursor, Codex CLI. ## What the Agent Learns ### Run Evaluation ```python from together import Together client = Together() # Use a judge model to evaluate outputs eval_result = client.evaluations.create( model="meta-llama/Llama-3.3-70B-Instruct-Turbo", criteria=["helpfulness", "accuracy", "safety"], inputs=[ {"prompt": "What is Python?", "response": "Python is a programming language..."}, ], ) for score in eval_result.scores: print(f"{score.criterion}: {score.value}/5") ``` ### Evaluation Criteria | Criterion | Measures | |-----------|----------| | Helpfulness | How useful is the response | | Accuracy | Factual correctness | | Safety | Harmful content detection | | Relevancy | On-topic response | | Coherence | Logical flow | ## FAQ **Q: Which model works best as judge?** A: Larger models (70B+) are more reliable judges. Llama 3.3 70B is recommended. ## Source & Thanks > Part of [togethercomputer/skills](https://github.com/togethercomputer/skills) — MIT licensed. ## 快速使用 ```bash npx skills add togethercomputer/skills ``` ## 什么是这个 Skill? 教 AI Agent 使用 Together AI 的评估框架,LLM-as-a-judge 自动评分模型输出。 **一句话总结**:Together AI 评估 Skill,LLM-as-a-judge 评分质量/安全/准确度,官方出品。 ## 来源与致谢 > [togethercomputer/skills](https://github.com/togethercomputer/skills) — MIT --- Source: https://tokrepo.com/en/workflows/6188a8b0-3520-4b00-b3f1-65c0a94f7715 Author: Agent Toolkit