# TruLens — Evaluate and Track LLM Apps

> Instrument LLM apps and run systematic evals for RAG quality and regressions to find failure modes fast. Combine tracing and scorecards in one workflow.

## Install

Save as a script file and run:

# TruLens — Evaluate and Track LLM Apps

> Instrument LLM apps and run systematic evals for RAG quality and regressions to find failure modes fast. Combine tracing and scorecards in one workflow.

## Quick Use

1. Install:
   ```bash
   pip install trulens
   ```
2. Run:
   ```bash
   python -c "import trulens; print('trulens ok')"
   ```
3. Verify:
   - Run one quickstart evaluation and confirm you get non-empty scores and a trace view for at least one run.


---

## Intro

Instrument LLM apps and run systematic evals for RAG quality and regressions to find failure modes fast. Combine tracing and scorecards in one workflow.

- **Best for:** RAG/agent builders who want measurable quality (before/after) instead of vibe-checking prompts
- **Works with:** Python, LLM app frameworks (LangChain/RAG pipelines), notebooks + CI-friendly eval runs
- **Setup time:** 15 minutes


### Quantitative Notes

- Setup time ~15 minutes (install + one quickstart notebook or script)
- GitHub stars + forks (verified): see Source & Thanks
- Start with 10–50 eval cases to catch regressions early (then scale up)


---

## Practical Notes

Treat evals like unit tests: freeze a small, representative dataset, define 2–4 core metrics, and make them run on every change that touches prompts/retrieval/tooling. When a score drops, inspect traces for which step (retrieval, reasoning, formatting) caused the regression.

**Safety note:** Avoid optimizing for a single metric—use a small metric set (quality + safety) and review traces for overfitting.

### FAQ

**Q: Is it only for RAG?**
A: No. It’s useful for any LLM app: chatbots, agents, tool callers, and prompt workflows.

**Q: How do I use it in CI?**
A: Export eval cases as data, run scoring on each PR, and fail the build on threshold drops.

**Q: What should I measure first?**
A: Start with retrieval relevance + groundedness for RAG, then add task success and safety checks.

---

## Source & Thanks

> GitHub: https://github.com/truera/trulens
> Owner avatar: https://avatars.githubusercontent.com/u/51224128?v=4
> License (SPDX): MIT
> GitHub stars (verified via `api.github.com/repos/truera/trulens`): 3,305
> GitHub forks (verified via `api.github.com/repos/truera/trulens`): 274


---

<!-- ZH -->

# TruLens——为 LLM 应用做评测与追踪

> 给 LLM 应用加可观测性并做系统化评测：覆盖 RAG 质量、反馈函数与回归测试，快速定位失败模式；把 tracing、评分与对比看板串成可复用工作流，并可接入 CI 做阈值回归与持续改进。

## 快速使用

1. 安装：
   ```bash
   pip install trulens
   ```
2. 运行：
   ```bash
   python -c "import trulens; print('trulens ok')"
   ```
3. 验证：
   - Run one quickstart evaluation and confirm you get non-empty scores and a trace view for at least one run.


---

## 简介

给 LLM 应用加可观测性并做系统化评测：覆盖 RAG 质量、反馈函数与回归测试，快速定位失败模式；把 tracing、评分与对比看板串成可复用工作流，并可接入 CI 做阈值回归与持续改进。

- **适合谁（Best for）:** 做 RAG/agent 的团队，希望用可量化指标做迭代，而不是只靠主观感受调 prompt
- **兼容工具（Works with）:** Python、各类 LLM 应用框架（LangChain/RAG pipeline）、Notebook 与 CI 评测流程
- **安装时间（Setup time）:** 15 分钟


### 量化信息

- 跑通约 15 分钟（安装 + 一个 quickstart notebook 或脚本）
- GitHub stars + forks（已核验）：见「来源与感谢」
- 建议先用 10–50 条用例做回归，再逐步扩展覆盖面


---

## 实战要点

把评测当单元测试：冻结一小份有代表性的用例集，定义 2–4 个核心指标，并对所有影响 prompt/检索/工具调用的改动强制运行。当分数下降时，结合 trace 定位是检索、推理还是格式化环节引起的回归。

**安全提示：** 不要只追一个指标；用小而稳的指标组合（质量 + 安全），并结合 trace 防止过拟合。

### FAQ

**Q: 只适用于 RAG 吗？**
A: 不是。任何 LLM 应用都能用：聊天、agent、工具调用、prompt 工作流等。

**Q: 怎么放进 CI？**
A: 把评测集数据化，每个 PR 跑评分；当指标跌破阈值时让 CI 失败。

**Q: 最先测什么？**
A: RAG 优先测检索相关性与有依据性；然后再补任务成功率与安全检查。

---

## 来源与感谢

> GitHub：https://github.com/truera/trulens
> Owner avatar：https://avatars.githubusercontent.com/u/51224128?v=4
> 许可证（SPDX）：MIT
> GitHub stars（已通过 `api.github.com/repos/truera/trulens` 核验）：3,305
> GitHub forks（已通过 `api.github.com/repos/truera/trulens` 核验）：274


---
Source: https://tokrepo.com/en/workflows/trulens-evaluate-and-track-llm-apps
Author: Agent Toolkit