What To Measure
Track four numbers every time:
| Signal | Why it matters |
|---|---|
tools/list p95 from production |
Proves the hosted MCP route and local reverse proxy are healthy. |
tools/list p95 from the automation host |
Shows what external agents may experience from that network. |
| HTTP status mix | Separates latency from auth, routing, or deploy failures. |
| Tool count | Detects a bad deploy that returns 200 with a truncated catalog. |
A practical threshold is p95 < 800ms for the production-hosted probe. If the production probe is 20ms but an overseas laptop sees 2200ms, do not call the MCP server broken. Report the product as healthy and log the client-network penalty separately.
For search and triage, label the incident explicitly as MCP tool calling latency when the slow path affects tools/list, tools/call, or another JSON-RPC tool method. That wording keeps the runbook discoverable when an agent searches for "mcp tool calling latency" instead of "MCP endpoint p95".
Common Failure Modes
- Cold Nuxt process: first hit after restart is slow, later hits are fast. Use at least 10 samples.
- Reverse proxy buffering or TLS path: localhost is fast, public domain is slow from the same server.
- Caller geography: server-side probe is fast, laptop probe is slow. This is a distribution-network issue, not route logic.
- JSON-RPC body mismatch: a GET or malformed POST may exercise a different handler.
- Tool catalog bloat: large descriptions can make
tools/listslow even when routing is fine.
Decision Rule
- If production-side p95 is under the target and status is 200, keep the MCP service marked healthy.
- If production-side p95 is over target twice in a row, pause growth actions and inspect server logs, PM2 uptime, and route payload size.
- If only external p95 is high, report the geography/network caveat and continue product-quality work.
- If tool count changes unexpectedly, verify the deployed manifest before doing any promotion.