Agenta — Open-Source LLMOps Platform
Prompt playground, evaluation, and observability in one platform. Compare prompts, run evals, trace production calls. 4K+ stars.
What it is
Agenta is an open-source LLMOps platform that combines prompt engineering, evaluation, and production observability in a single tool. It provides a visual playground for testing prompts across models, a framework for running automated evaluations, and tracing for monitoring production LLM calls. The platform works with OpenAI, Anthropic, and any OpenAI-compatible API.
Agenta targets AI engineering teams who need to iterate on prompts systematically, compare model performance with quantitative metrics, and monitor production LLM applications without stitching together separate tools for each concern.
How it saves time or tokens
Agenta's side-by-side prompt comparison lets you test variations against the same inputs simultaneously. Instead of running prompts sequentially and manually comparing outputs, you see results side by side with latency and cost metrics. The evaluation framework automates quality checks, reducing the manual review burden.
Production tracing captures every LLM call with inputs, outputs, latency, and cost. When issues arise, you trace them to the specific prompt version and input that caused the problem, rather than guessing.
How to use
- Deploy Agenta via Docker Compose:
docker compose upfrom the repository. Access the web UI at localhost. - Create an application in the playground. Write your prompt template, select a model, and test with sample inputs.
- Set up evaluations with test datasets. Define evaluation criteria (exact match, LLM-as-judge, custom metrics) and run batch evaluations to compare prompt versions quantitatively.
Example
import agenta as ag
# Define a prompt variant
@ag.entrypoint
async def summarize(text: str) -> str:
response = await ag.llm.chat.completions.create(
model='gpt-4o',
messages=[
{'role': 'system', 'content': 'Summarize the following text concisely.'},
{'role': 'user', 'content': text}
],
temperature=ag.FloatParam(0.3, 0, 1)
)
return response.choices[0].message.content
The ag.FloatParam makes temperature adjustable from the UI without code changes. Each variant is tracked with version history.
Related on TokRepo
- AI gateway providers — LLM proxy and observability solutions
- AI tools for testing — Testing frameworks for AI applications
Common pitfalls
- Self-hosted Agenta requires Docker and reasonable resources (4GB+ RAM). The managed cloud version avoids infrastructure management but has usage limits on the free tier.
- Evaluation datasets need to be representative of production traffic. Running evals on toy examples gives misleading results about prompt quality.
- Tracing adds minimal overhead but stores all inputs and outputs. For applications processing sensitive data, configure data retention policies and redaction rules.
Frequently Asked Questions
Both provide prompt management and observability. Agenta is fully open source and self-hostable, while LangSmith is a commercial product from LangChain. Agenta is framework-agnostic (works without LangChain), while LangSmith has deeper integration with the LangChain ecosystem.
Yes. Agenta works with any OpenAI-compatible API. You can connect it to Ollama, vLLM, or any local model server that exposes an OpenAI-compatible endpoint for prompt testing and evaluation.
Agenta supports exact match, regex match, LLM-as-judge (using another LLM to score outputs), custom Python evaluators, and human evaluation workflows. You can combine multiple evaluators in a single evaluation run.
Yes. Every prompt change is tracked as a version. You can compare any two versions side by side, see evaluation scores for each version, and roll back to a previous version in one click.
Yes. Agenta supports team workspaces where multiple users can create and test prompt variants, run evaluations, and review production traces. Role-based access control is available in the managed cloud version.
Citations (3)
- Agenta GitHub— Open-source LLMOps platform with prompt playground and evaluation
- Agenta Documentation— Supports OpenAI, Anthropic, and OpenAI-compatible APIs
- Anthropic Evaluation Guide— LLM evaluation best practices