What is This Skill?
This skill teaches AI coding agents how to use Together AI's evaluation framework. It provides LLM-as-a-judge patterns for scoring model outputs on quality, safety, helpfulness, and task-specific criteria.
Answer-Ready: Together AI Evaluations Skill for coding agents. LLM-as-a-judge framework for scoring model outputs. Evaluate quality, safety, and task completion automatically. Part of official 12-skill collection.
Best for: ML teams evaluating and comparing LLM outputs. Works with: Claude Code, Cursor, Codex CLI.
What the Agent Learns
Run Evaluation
from together import Together
client = Together()
# Use a judge model to evaluate outputs
eval_result = client.evaluations.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
criteria=["helpfulness", "accuracy", "safety"],
inputs=[
{"prompt": "What is Python?", "response": "Python is a programming language..."},
],
)
for score in eval_result.scores:
print(f"{score.criterion}: {score.value}/5")Evaluation Criteria
| Criterion | Measures |
|---|---|
| Helpfulness | How useful is the response |
| Accuracy | Factual correctness |
| Safety | Harmful content detection |
| Relevancy | On-topic response |
| Coherence | Logical flow |
FAQ
Q: Which model works best as judge? A: Larger models (70B+) are more reliable judges. Llama 3.3 70B is recommended.