Docs_archive
Mesh observability
Source docs/MESH_OBSERVABILITY.md · All_docs
Hive already emits structured logs (mesh.a2a.*, mesh.federation.*). This document describes the optional OTel layer: traces and metrics to an OTLP collector (HTTP), for production diagnostics (latencies, 429s, Redis window saturation).
Activation
Environment variables (see also app/.env.example):
The SDK is started from src/instrumentation.ts (Node runtime only) and shut down gracefully on shutdown (SIGTERM / SIGINT).
Traces
The aliases /api/a2a/federated/jsonrpc and the main JSON-RPC path already go through handleA2AJsonRpcPost, so the same traceparent extraction as for mesh.a2a.jsonrpc applies to them.
Routes wrapped by withHttpServerSpan (non-exhaustive — grep withHttpServerSpan in app/src/app/api and app/src/app/.well-known): health, federation (status, directory), setup, agents (CRUD + start/stop/pause/resume/restart, SSE logs, stats, usage, MCP), models, import (parse + deploy), export, users/tokens/secrets (including [id]), backups (list + [id]/status|restore|download), config (root GET, reload, [key]), stats / network / update / inference / system audit, audit-logs, auth (register, registration status, forgot/reset password), A2A status, public discovery (/.well-known/agent-card.json, /.well-known/hive-mesh.json). Excluded from this helper: NextAuth (auth/[...nextauth]), and the JSON-RPC mounts already instrumented (mesh.a2a.jsonrpc, federation proxy).
Common attributes:
hive.entrypoint:main|public_alias|federated_alias(physical mount)rpc.method: JSON-RPC methodhttp.status_code: HTTP response statusmesh.a2a.outcome:ok|jsonrpc_error|exception|invalid_json
SSE requests: the span ends when the stream closes (duration includes streaming).
W3C correlation (traceparent)
On POST A2A JSON-RPC (handleA2AJsonRpcPost), Hive extracts the traceparent and tracestate headers (W3C Trace Context) and records the mesh.a2a.jsonrpc span as a child of the upstream trace when traceparent is present. Configure the reverse proxy / API gateway to not strip these headers toward the Hive origin.
The SDK starts with W3C trace context + baggage propagation (CompositePropagator in otel-bootstrap.ts). AsyncLocalStorageContextManager is registered at boot in instrumentation.ts so that the OTel context survives await calls in the JSON-RPC handler (with or without OTLP export).
There is no auto-instrumentation agent @opentelemetry/instrumentation-http on all of Next: the handlers listed above go through withHttpServerSpan (src/lib/otel-http-route.ts). Outbound federation fetch calls outside the JWT proxy use meshOutboundFetch (src/lib/otel-client-fetch.ts).
Metrics
Logical prefix hive.mesh.*:
Attribute mesh.rate_limit.tier:
public_a2a—hive:rl:public_a2afederation—hive:rl:federationa2a_jsonrpc—hive:rl:a2a(Redis A2A middleware)other— other prefixes
RPC histogram metrics notably carry mesh.a2a.outcome, hive.entrypoint, rpc.method, and mesh.a2a.streaming when relevant.
Typical SLO queries (Prometheus)
- P99 RPC latency: histogram quantile on
hive.mesh.a2a.rpc.duration_ms - P99 latency per HTTP route: no dedicated mesh histogram — use
http.servertraces +http.routeattribute (tracerhive.http), or a span-metrics pipeline; seeobservability/README.md - RL block rate:
rate(hive_mesh_rate_limit_blocked_total[5m])bymesh_rate_limit_tier(exact name depends on OTLP → Prometheus exposition) - Saturation: distribution or threshold on
hive.mesh.rate_limit.window_utilization_ratio
Exported names may be normalized by the backend (dots vs underscores); verify in the collector UI.
Code files
app/src/lib/otel-bootstrap.ts— SDK loading + OTLP exportersapp/src/lib/mesh-otel.ts— mesh instruments + span helpersapp/src/lib/otel-http-route.ts—withHttpServerSpan(tracerhive.http)app/src/app/(dashboard)/observability/page.tsx— native UI (Recharts + traces);app/src/app/api/observability/prometheus/route.ts— Prometheus proxy;app/src/app/api/observability/tempo/*— Tempo proxy (TEMPO_OBSERVABILITY_URL)app/src/lib/otel-client-fetch.ts—meshOutboundFetch(CLIENT + W3C inject)app/src/lib/rate-limit.ts— centralized RL observationapp/src/lib/a2a/jsonrpc-next.ts— RPC span + histogramapp/src/lib/mesh-federation-manifest.ts— signed manifest (spanmesh.federation.manifest_fetch)app/src/lib/mesh-federation-probe.ts— agent card probes (spanmesh.federation.probe.agent_card)docs/observability/otel-collector-spanmetrics.example.yaml— OTLP collector + spanmetrics (namespace: hive_span) → Prometheusdocs/observability/otel-collector-docker.local.yaml— same + trace export to Tempo (stackdocker compose --profile otel)docs/observability/tempo-local.yaml,prometheus-otel-local.yml— local observability stack (Dockerotelprofile)app/scripts/check-api-routes-otel.cjs— CI guardrail: API routes +.well-knownmust usewithHttpServerSpan(excluding exemptions)docs/observability/prometheus-rules.hive.yml— Prometheus alerts (seeALERTING.md)
See also
observability/README.md— stack compose + reference PromQL + native/observabilityUIobservability/ALERTING.md— Prometheus alertsPRE_PUBLIC_BETA_CHECKLIST.md— pre-opening checklist- MESH_GATEWAY_WAN.md — dedicated federation reverse proxy
- MESH_MTLS.md — inter-instance mTLS (PKI)
- TECH_VISION.md — long-term observability layer