Docs_archive
Federation runbook
Source docs/MESH_FEDERATION_RUNBOOK.md · All_docs
Concise procedure for pairing two (or N) Hive deployments with the transport documented in MESH_V2_GLOBAL.md. Prerequisites: valid TLS (or explicit HTTP in lab), Redis for federated rate limits and jti consumption (JWT anti-replay) if JWT federation is used with MESH_FEDERATION_JWT_REQUIRE_JTI=true (default), A2A enabled on both sides.
1. Variables to Align
Also verify: AUTH_URL, A2A_ENABLED=true, outbound network allowed to peer origins (firewall).
2. Verifications (Without UI)
-
Indexed directory (session or token viewer+):
GET /api/mesh/federation/directory→peers[].peerIndex,origin,agentCardUrl(list = static + manifest merged, same as the proxy — no HTTP call from the directory endpoint). -
Operator probe (cookie or token operator+):
GET /api/mesh/federation/status?probe=1→probe[]with HTTP / latency to/.well-known/agent-card.jsonof each peer (origins = static + manifest merged).Optional:
GET /api/mesh/federation/status?debug_manifest=1(operator+) →manifestDebug.manifestLastError: safe token (http_404,fetch_timeout, …) orunknown, no free-text network/DNS data. -
Local agent card:
GET /.well-known/agent-card.jsonon each instance. -
Public mesh descriptor (WAN discovery):
GET /.well-known/hive-mesh.json(no auth) — JSONhive-mesh-descriptor-v1with A2A links + federation /wanMeshsummary. -
Postgres audit (
audit_logs, retention per your policy):mesh.federation.proxy_jsonrpc— each relay viaPOST /api/mesh/federation/proxy/jsonrpc(operator user ifuserIdknown,resourceId= target peer origin, upstream HTTP status,correlationId).mesh.federation.inbound_jsonrpc— each response returned after an authenticated federationPOSTon/api/a2a/jsonrpc,/api/a2a/jsonrpc/publicor/api/a2a/federated/jsonrpc: JSON-RPC method (preview), HTTP status, auth mode,jwtIss(peer origin in Ed25519; constant issuer in HS256),entrypoint(main/public_alias/federated_alias), normalized client IP.
3. Peer-to-Peer JSON-RPC Test (Lab)
Recommended: go through the operator proxy (next section) — the server mints a short X-Hive-Federation-JWT with aud, jti, and only sends the legacy secret if MESH_FEDERATION_PROXY_SEND_SECRET=true.
Direct to B with the legacy secret (hive-federated identity on B side) — do not send JWT and secret on the same request. Same handler as POST /api/a2a/jsonrpc: you can target POST /api/a2a/federated/jsonrpc for a dedicated ingress policy (firewall / docs).
curl -sS -X POST "$HIVE_B/api/a2a/jsonrpc" \
-H "Content-Type: application/json" \
-H "X-Hive-Federation-Secret: $MESH_FEDERATION_SHARED_SECRET" \
-d '{"jsonrpc":"2.0","id":1,"method":"tasks/list","params":{}}'
Operator proxy on A (session or operator token, JSON body) — peerIndex = rank of B in A's MESH_FEDERATION_PEERS:
curl -sS -X POST "$HIVE_A/api/mesh/federation/proxy/jsonrpc" \
-H "Content-Type: application/json" \
-H "Cookie: …" \
-d '{"peerIndex":0,"rpc":{"jsonrpc":"2.0","id":1,"method":"tasks/list","params":{}}}'
4. Common Incidents
5. Shared Secret Rotation
- Generate a new value ≥ 32 characters.
- Update simultaneously (short window) on all paired nodes.
- Calls with the old secret fail immediately after switchover — plan automatic retry on the client side if applicable.
JWT: rotating the MESH_FEDERATION_SHARED_SECRET immediately invalidates JWTs still unexpired and signed with the old key. Already consumed jti entries remain in Redis until the TTL key expires (aligned with the token's exp) — no manual purge needed during secret rotation. After switchover, only new JWTs (operator proxy or external mint aligned with the new key) are accepted.
Clock: verification uses MESH_FEDERATION_JWT_CLOCK_SKEW_LEEWAY_SECONDS (default 60, max 300, 0 = strict) for exp / iat; aim for NTP on VMs if possible.
6. Revoking a Link (Without Redeploying the Entire Mesh)
- Statically listed peer: remove its origin from
MESH_FEDERATION_PEERSon the nodes that should no longer trust it, then apply the config (standard process: targeted restart / redeploy). - Peer from the manifest: remove the entry from
payload.peerson the manifest publication side; after at mostMESH_FEDERATION_PEERS_MANIFEST_REFRESH_SECONDS(+ Redis / L1 caches), the effective roster updates — no need to touch other instances for the "whole" product. - Ed25519: remove the peer's public key from
MESH_FEDERATION_PEER_ED25519_PUBLIC_KEYS(same order as origins) to cut off signature verification without changing the rest of the mesh.
7. Centralized policy (optional org-wide)
For fleets that want policy (roster, publish rules, gateway posture) versioned above raw env vars, see MESH_CENTRALIZED_POLICY.md — MVP = GitOps env; later = signed bundles / OPA. Federation pairing steps in this runbook are unchanged.
8. References
PRODUCTION.md— RBAC / routes matrix.A2A_INTEGRATION.md— A2A routes.PETIT_GROS_AUDIT.md— federation surface + rate limit.MESH_OBSERVABILITY.md— OpenTelemetry (mesh traces / metrics).MESH_GATEWAY_WAN.md— WAN reverse-proxy.MESH_MTLS.md— inter-instance mTLS (PKI / SPIFFE).MESH_CENTRALIZED_POLICY.md— policy distribution + audit (architecture).
9. Design Notes (Audit — Non-Blocking)
req.clone()+ body peek: each JSON-RPCPOSTclones the body for method preview / public branch before the handler. Negligible impact as long asA2A_JSONRPC_MAX_BODY_BYTESstays in the MB range.- Public allowlist cache (
publicA2aAllowedMethodSet): in-memory singleton per env string — consistent with a process-wide env that is immutable between restarts; a hot-reload of env without a restart could theoretically serve a stale set in dev. - Postgres fire-and-forget audit (
mesh.federation.*): does not block the RPC response; insert failure → error log only. Acceptable for latency; OTel metrics (hive.mesh.*, seeMESH_OBSERVABILITY.md) cover RPC latency and Redis rate limit saturation — no dedicated span per audit insert yet.