# TruLens — Evaluate and Track LLM Apps > Instrument LLM apps and run systematic evals for RAG quality and regressions to find failure modes fast. Combine tracing and scorecards in one workflow. ## Install Save as a script file and run: # TruLens — Evaluate and Track LLM Apps > Instrument LLM apps and run systematic evals for RAG quality and regressions to find failure modes fast. Combine tracing and scorecards in one workflow. ## Quick Use 1. Install: ```bash pip install trulens ``` 2. Run: ```bash python -c "import trulens; print('trulens ok')" ``` 3. Verify: - Run one quickstart evaluation and confirm you get non-empty scores and a trace view for at least one run. --- ## Intro Instrument LLM apps and run systematic evals for RAG quality and regressions to find failure modes fast. Combine tracing and scorecards in one workflow. - **Best for:** RAG/agent builders who want measurable quality (before/after) instead of vibe-checking prompts - **Works with:** Python, LLM app frameworks (LangChain/RAG pipelines), notebooks + CI-friendly eval runs - **Setup time:** 15 minutes ### Quantitative Notes - Setup time ~15 minutes (install + one quickstart notebook or script) - GitHub stars + forks (verified): see Source & Thanks - Start with 10–50 eval cases to catch regressions early (then scale up) --- ## Practical Notes Treat evals like unit tests: freeze a small, representative dataset, define 2–4 core metrics, and make them run on every change that touches prompts/retrieval/tooling. When a score drops, inspect traces for which step (retrieval, reasoning, formatting) caused the regression. **Safety note:** Avoid optimizing for a single metric—use a small metric set (quality + safety) and review traces for overfitting. ### FAQ **Q: Is it only for RAG?** A: No. It’s useful for any LLM app: chatbots, agents, tool callers, and prompt workflows. **Q: How do I use it in CI?** A: Export eval cases as data, run scoring on each PR, and fail the build on threshold drops. **Q: What should I measure first?** A: Start with retrieval relevance + groundedness for RAG, then add task success and safety checks. --- ## Source & Thanks > GitHub: https://github.com/truera/trulens > Owner avatar: https://avatars.githubusercontent.com/u/51224128?v=4 > License (SPDX): MIT > GitHub stars (verified via `api.github.com/repos/truera/trulens`): 3,305 > GitHub forks (verified via `api.github.com/repos/truera/trulens`): 274 --- # TruLens——为 LLM 应用做评测与追踪 > 给 LLM 应用加可观测性并做系统化评测:覆盖 RAG 质量、反馈函数与回归测试,快速定位失败模式;把 tracing、评分与对比看板串成可复用工作流,并可接入 CI 做阈值回归与持续改进。 ## 快速使用 1. 安装: ```bash pip install trulens ``` 2. 运行: ```bash python -c "import trulens; print('trulens ok')" ``` 3. 验证: - Run one quickstart evaluation and confirm you get non-empty scores and a trace view for at least one run. --- ## 简介 给 LLM 应用加可观测性并做系统化评测:覆盖 RAG 质量、反馈函数与回归测试,快速定位失败模式;把 tracing、评分与对比看板串成可复用工作流,并可接入 CI 做阈值回归与持续改进。 - **适合谁(Best for):** 做 RAG/agent 的团队,希望用可量化指标做迭代,而不是只靠主观感受调 prompt - **兼容工具(Works with):** Python、各类 LLM 应用框架(LangChain/RAG pipeline)、Notebook 与 CI 评测流程 - **安装时间(Setup time):** 15 分钟 ### 量化信息 - 跑通约 15 分钟(安装 + 一个 quickstart notebook 或脚本) - GitHub stars + forks(已核验):见「来源与感谢」 - 建议先用 10–50 条用例做回归,再逐步扩展覆盖面 --- ## 实战要点 把评测当单元测试:冻结一小份有代表性的用例集,定义 2–4 个核心指标,并对所有影响 prompt/检索/工具调用的改动强制运行。当分数下降时,结合 trace 定位是检索、推理还是格式化环节引起的回归。 **安全提示:** 不要只追一个指标;用小而稳的指标组合(质量 + 安全),并结合 trace 防止过拟合。 ### FAQ **Q: 只适用于 RAG 吗?** A: 不是。任何 LLM 应用都能用:聊天、agent、工具调用、prompt 工作流等。 **Q: 怎么放进 CI?** A: 把评测集数据化,每个 PR 跑评分;当指标跌破阈值时让 CI 失败。 **Q: 最先测什么?** A: RAG 优先测检索相关性与有依据性;然后再补任务成功率与安全检查。 --- ## 来源与感谢 > GitHub:https://github.com/truera/trulens > Owner avatar:https://avatars.githubusercontent.com/u/51224128?v=4 > 许可证(SPDX):MIT > GitHub stars(已通过 `api.github.com/repos/truera/trulens` 核验):3,305 > GitHub forks(已通过 `api.github.com/repos/truera/trulens` 核验):274 --- Source: https://tokrepo.com/en/workflows/trulens-evaluate-and-track-llm-apps Author: Agent Toolkit