TOKREPO · 主题包

稳定

上线 + 监控 + 可观测性一站套件

10 件套，给真正把代码上线的开发者：部署目标（Vercel / Kamal / Coolify）+ 错误追踪 + OpenTelemetry + 指标 + 日志 + 仪表盘 + 在线状态 + 告警 — 按顺序串起来，下次宕机才能真正接到电话。

10 个资产

关于这个主题包

这个 pack 包含什么

这是一个真正的后端工程师在产品有真实用户之前那一周会装的套件 — 不是宕机后通宵抢救的那种。每个都开源优先、$20 VPS 跑得起、能跟下一个工具串起来。顺序很重要：每一层喂养下一层。

#	工具	层	干什么
1	Vercel CLI	部署（PaaS）	每次 `git push` 自动给一个预览 URL，Next/Nuxt/Astro 零配置
2	Kamal	部署（容器）	零停机 Docker 部署到任意 VPS — Basecamp 自家工具
3	Coolify	部署（自托管 PaaS）	开源版 Vercel/Heroku，跑在你自己服务器上
4	Sentry	错误 + APM	异常捕获 / 发布健康度 / 性能追踪
5	OpenTelemetry Collector	遥测管道	厂商无关的汇聚层，traces / metrics / logs 一起进
6	Prometheus	指标	拉取式时序数据库，业界默认
7	Grafana Loki	日志	像 Prometheus 思路的日志存储 — 便宜、按标签索引
8	Grafana	仪表盘	所有数据汇聚到的大屏
9	Uptime Kuma	在线状态 + 状态页	自托管心跳，宕机时给你打电话；自带 public 状态页
10	Prometheus Alertmanager	告警路由	去重 / 分组 / 路由到 PagerDuty / 飞书 / Slack / 邮件

你会遇到的取舍

Vercel vs Kamal vs Coolify — Vercel：零运维、可缩到零、规模上来很贵、栈不归你。Kamal：服务器归你、Docker 是唯一抽象、便宜可预测。Coolify：中间路线，自托管 UI + Docker。大多数团队 MVP 跑 Vercel，账单到 $500/月就迁 Kamal/Coolify。
Sentry SaaS vs 自托管 — 自托管 Sentry 要起 6 个服务（Kafka / Postgres / Redis / ClickHouse 等）。月 10w event 以下，SaaS 免费层比你的人力时间便宜。等过了免费层 + 有运维人手再自托管。
Prometheus + Loki + Grafana vs Datadog — Datadog 是付费精品。开源栈 ~$20/月 VPS vs Datadog 单 host $15-31/月，10 个 host 就 $300+/月。代价：你自己照看这套。10 个服务以下开源完胜，50 个服务以上 Datadog 的 ergonomics 开始值钱。
推 vs 拉指标 — Prometheus 是拉模型（它来抓你）。如果你是 Serverless 或短生命周期 job，拉不动 — 用 Pushgateway，或者切到 OpenTelemetry 推到 Collector。别跟模型硬刚。

常见踩坑

告警打在原因上而不是症状上。「CPU > 80%」凌晨 3 点把你叫醒，工作负载其实没事。「面向用户的 p95 > 2s」只在真出问题时叫你。先调症状告警，醒了再查原因。
Grafana 不打 release 标记。半数事故都是「刚发完版就坏的」。部署脚本里加一条 POST Grafana annotation，时间轴上那道火苗每次事故省你 20 分钟。
日志字段全索引。Loki 的卖点就是不全索引。如果你每条日志加 50 个 label，基数爆炸，便宜的日志存储变贵的。按 service / env / level 打标，剩下的 grep。
所有告警走同一个通道。P1（站挂了）→ 电话。P2（降级）→ Slack @channel。P3（异常）→ 每日汇总。混在一起要么忽略电话要么忽略汇总，两边都失败。
没有外部 uptime 探测。你内网 Prometheus 觉得服务在线，但 Cloudflare 或 CDN 在 eu-west 把 30% 的请求丢了。从不同网络的 Uptime Kuma 能抓到。5 分钟就能配。

安装 · 一行命令

$ tokrepo install pack/deploy-monitor-observability

丢给 agent，或粘到终端

包内含什么

10 个资产打包就绪

Script#01

Vercel CLI — Preview Deployments from Terminal

Vercel CLI runs dev servers, pulls project env, and creates preview or production deployments from the terminal. Useful for agent-built web changes.

by Vercel·237 views

$ tokrepo install vercel-cli-preview-deployments-from-terminal

Skill#02

Kamal — Zero-Downtime Docker Deploys to Any Server

Kamal is Basecamp's deploy tool that ships Docker containers to bare metal or cloud VMs with a single command, giving you Heroku-like workflows on servers you actually own.

by Script Depot·261 views

$ tokrepo install kamal-zero-downtime-docker-deploys-any-server-5211d45c

Skill#03

Coolify — Self-Hosted Vercel & Netlify Alternative

Deploy apps, databases, and services on your own server with one click. No vendor lock-in. 52K+ GitHub stars.

by AI Open Source·229 views

$ tokrepo install coolify-self-hosted-vercel-netlify-alternative-202dfab1

Skill#04

Sentry — Open Source Error Tracking & Performance Monitoring

Sentry is the developer-first error tracking and performance monitoring platform. Capture exceptions, trace performance issues, and debug production errors across all languages.

by AI Open Source·329 views

$ tokrepo install sentry-open-source-error-tracking-performance-monitoring-ece57add

Skill#05

OpenTelemetry Collector — Vendor-Neutral Telemetry Pipeline

The OpenTelemetry Collector is the CNCF-graduated pipeline for receiving, processing, and exporting metrics, logs, and traces across any observability backend, replacing per-vendor agents with one portable binary.

by AI Open Source·265 views

$ tokrepo install opentelemetry-collector-vendor-neutral-telemetry-pipeline-1e161adc

Skill#06

Prometheus — Open Source Monitoring & Alerting Toolkit

Prometheus is the CNCF-graduated monitoring system and time series database. Pull-based metrics collection, powerful PromQL queries, and built-in alerting for cloud-native infrastructure.

by AI Open Source·248 views

$ tokrepo install prometheus-open-source-monitoring-alerting-toolkit-ed3a8de4

Skill#07

Grafana Loki — Prometheus-Inspired Log Aggregation System

Loki is a horizontally scalable, multi-tenant log aggregation system by Grafana Labs. Unlike other log systems, Loki indexes metadata about logs, not log content itself.

by Grafana Labs·404 views

$ tokrepo install grafana-loki-prometheus-inspired-log-aggregation-system-92fa7c1f

Skill#08

Grafana — Open Source Data Visualization & Observability

Grafana is the leading open-source platform for monitoring and observability. Visualize metrics, logs, and traces from Prometheus, Loki, Elasticsearch, and 100+ data sources.

by Grafana Labs·348 views

$ tokrepo install grafana-open-source-data-visualization-observability-ed1a524f

MCP#09

Uptime Kuma — Self-Hosted Uptime Monitoring

Monitor HTTP, TCP, DNS, Docker services with notifications to 90+ channels. Beautiful dashboard. 84K+ GitHub stars.

by MCP Hub·320 views

$ tokrepo install uptime-kuma-self-hosted-uptime-monitoring-88e260be

Skill#10

Prometheus Alertmanager — Alert Routing and Notification Hub

Alertmanager handles alerts sent by Prometheus, deduplicating, grouping, and routing them to the right notification channel such as email, Slack, PagerDuty, or webhooks.

by Script Depot·230 views

$ tokrepo install prometheus-alertmanager-alert-routing-notification-hub-51f92d7e

常见问题

10 个工具真的都需要吗？看着好多。

你需要每一层选一个，不是 10 个全装。pack 里在同一层列了备选（3 个部署目标、metrics 可走 Prometheus 也可走 OTel）— 按你的规模选。1 人独立开发者最小可行栈：Vercel CLI + Sentry + Uptime Kuma。第二个工程师进来再加 Prometheus + Grafana + Alertmanager。超过 10 个服务再加 Loki + OpenTelemetry Collector。别提前装。

这套月成本大概多少？

小团队：Vercel 免费或 $20/月，Sentry 免费层（5k 事件/月）或 $26/月，剩下一台 $5-20 VPS 同时跑 Prometheus + Loki + Grafana + Uptime Kuma + Alertmanager（全都 RAM 占用很轻）。合计 $25-60/月就能拿到真能抓宕机的可观测性。对比 Datadog 单 host $15-31/月，10 host 经常 $300+/月。

这个跟 LLM Observability pack 重叠吗？

LLM Observability（Langfuse / Phoenix / AgentOps）是应用语义层 — prompt 追踪 / token 成本 / eval 评分。这个 Deploy + Monitor + Observability pack 是基础设施层 — 容器活着没、HTTP p95 行不行、上次发版有没有把错误率打爆。两个都要。本 pack 里的 OpenTelemetry Collector 能从 Langfuse/Phoenix 收 LLM trace 转发到下游，on-call 在同一个 Grafana 大盘看到两层数据。

为啥推 Kamal 不推 Docker Swarm 或 Nomad？

Kamal 主张极强，简单到无聊 — 这正是部署工具该有的样子。它只做零停机容器 rollout + traefik 路由，没有 scheduler、没有 service mesh、没有 YAML 教堂。1-10 台服务器场景下，它是「能跑就行」的最简方案。Swarm 已进入维护期；Nomad 很好但运维成本对小团队偏大。等到团队里有一个人全职做 k8s，再上 k8s。

Serverless 后端（AWS Lambda / Cloudflare Workers）能用这套吗？

能，但拉模型不行了。Serverless 用 OpenTelemetry SDK 推模式：trace 和 metric 通过 OTLP 推到 OpenTelemetry Collector。Collector 再 remote_write 进 Prometheus、写进 Loki，下游其他工具一行不用改。Uptime Kuma 照样 ping 公网 URL，Sentry SDK 在 Lambda/Workers 运行时正常工作，Grafana 仪表盘不在乎数据从哪来。

更多主题包

12 个主题包 · 80+ 精选资产

回首页浏览全部精选合集

返回主题包总览

上线 + 监控 + 可观测性 一站套件