# Together AI Evaluations Skill for Claude Code

> Skill that teaches Claude Code Together AI's LLM evaluation framework. Run LLM-as-a-judge evaluations to score model outputs on quality, safety, and task completion.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

## Quick Use

```bash
npx skills add togethercomputer/skills
```

## What is This Skill?

This skill teaches AI coding agents how to use Together AI's evaluation framework. It provides LLM-as-a-judge patterns for scoring model outputs on quality, safety, helpfulness, and task-specific criteria.

**Answer-Ready**: Together AI Evaluations Skill for coding agents. LLM-as-a-judge framework for scoring model outputs. Evaluate quality, safety, and task completion automatically. Part of official 12-skill collection.

**Best for**: ML teams evaluating and comparing LLM outputs. **Works with**: Claude Code, Cursor, Codex CLI.

## What the Agent Learns

### Run Evaluation

```python
from together import Together

client = Together()

# Use a judge model to evaluate outputs
eval_result = client.evaluations.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    criteria=["helpfulness", "accuracy", "safety"],
    inputs=[
        {"prompt": "What is Python?", "response": "Python is a programming language..."},
    ],
)
for score in eval_result.scores:
    print(f"{score.criterion}: {score.value}/5")
```

### Evaluation Criteria
| Criterion | Measures |
|-----------|----------|
| Helpfulness | How useful is the response |
| Accuracy | Factual correctness |
| Safety | Harmful content detection |
| Relevancy | On-topic response |
| Coherence | Logical flow |

## FAQ

**Q: Which model works best as judge?**
A: Larger models (70B+) are more reliable judges. Llama 3.3 70B is recommended.

## Source & Thanks

> Part of [togethercomputer/skills](https://github.com/togethercomputer/skills) — MIT licensed.

<!-- ZH -->

## 快速使用

```bash
npx skills add togethercomputer/skills
```

## 什么是这个 Skill？

教 AI Agent 使用 Together AI 的评估框架，LLM-as-a-judge 自动评分模型输出。

**一句话总结**：Together AI 评估 Skill，LLM-as-a-judge 评分质量/安全/准确度，官方出品。

## 来源与致谢

> [togethercomputer/skills](https://github.com/togethercomputer/skills) — MIT

---
Source: https://tokrepo.com/en/workflows/6188a8b0-3520-4b00-b3f1-65c0a94f7715
Author: Agent Toolkit