Docs_archive
Production
Source docs/PRODUCTION.md · All_docs
This document supplements the code and environment variables. Long-term vision remains in TECH_VISION.md; inference optimization in llm-optimization.md.
Install on a server (step-by-step, French): SERVER_INSTALL.md — Docker Compose prod, DNS, Let’s Encrypt, first admin wizard, migrations.
Operator config guide (env vs dashboard): OPERATOR_CONFIG.md.
1. Prerequisites
- Node 22 (see
app/package.json→engines). - PostgreSQL and Redis accessible from the app.
- Linux for Firecracker (agent microVMs); otherwise paths depend on your deployment matrix.
2. Required Environment Variables
Full reference: app/.env.example.
Optional but Recommended in Production
2.1 Stripe webhooks
- Endpoint:
POST /api/webhooks/stripe(full URL:https://<your-public-host>/api/webhooks/stripe). - Configuration: set
STRIPE_WEBHOOK_SECRETinappenv; configure the same URL in the Stripe Dashboard (test mode first). - Behavior: verifies
Stripe-Signature, hints idempotency in Redis (stripe:webhook:event:*), applies wallet updates in Postgres (billing_ledger_entriesunique onstripe_event_id+user_wallet_balances), writesaudit_logs(billing.stripe.webhook), and logs structuredbilling.stripe.*. - Crediting users: set Stripe metadata on the PaymentIntent / Checkout Session so webhooks can resolve a Hive user:
hive_user_id(preferred) oruser_id— must be a UUID matchingusers.id.- Alternatively, set
users.stripe_customer_idto the Stripe Customer id (cus_…) and use that customer on the payment — resolution is inapp/src/lib/stripe/stripe-wallet-handlers.ts.
- Events handled:
payment_intent.succeeded(credit for non-invoiced payments — if the PaymentIntent has aninvoice, credit is applied viainvoice.paidinstead to avoid double-counting),invoice.paid(credit for invoiced charges, including subscriptions, resolved byusers.stripe_customer_id),refund.created(debit),checkout.session.completed(setsusers.stripe_customer_idwhenmetadata.hive_user_idmatches a user andsession.customeris acus_…id).refund.createdresolves the user via metadata or a prior ledger row with the samepayment_intent. - Read balance:
GET /api/billing/wallet(session or API token) returns{ balanceMinor, currency, updatedAt, billingUsageMinorPer1kTokens, stripe: { checkoutEnabled, customerPortalEnabled, subscriptionCheckoutEnabled } }(billingUsageMinorPer1kTokensmirrorsBILLING_USAGE_MINOR_PER_1K_TOKENS; 0 means usage debits are off). - Ledger:
GET /api/billing/ledger?limit=20&offset=0— paginated entries for the current user (signedAmountMinorfor display; includes optionalusage_debitlines when inference metering is enabled). Requires migration0022_billing_ledger_user_created_idxfor efficient sorting. - Inference metering (optional): set
BILLING_USAGE_MINOR_PER_1K_TOKENS(integer minor units per 1,000 total tokens). When > 0, the token sync daemon debits the agent owner (agents.created_by) after eachinference_usagerow, idempotent per row viastripe_event_id=hive_usage:{inference_usage.id}. Skips the debit if the wallet balance is below the charge (usage is still recorded). - Checkout:
POST /api/billing/stripe/checkout-session— one-time:{ "mode": "payment", "amountMinor": 1000, "currency": "usd" }(defaultmodeispayment). Subscription:{ "mode": "subscription" }withSTRIPE_SUBSCRIPTION_PRICE_IDset, or pass"priceId": "price_…". Returns{ url, sessionId, mode }. RequiresSTRIPE_SECRET_KEY. - Customer Portal:
POST /api/billing/stripe/customer-portalreturns{ url }when the user hasusers.stripe_customer_idset (after Checkout). Enable the Customer portal in the Stripe Dashboard (Billing → Customer portal) or the API may return an error. - If unset: route returns 503
stripe_webhooks_not_configured(intentional — no accidental open endpoint).
3. Health / Load Balancers
GET /api/health— unauthenticated, for LB/orchestrators.- Default: minimal response
{ "ok": true }if the process responds. - With
HEALTH_CHECK_DEEP=true: checks the database; does not disclose the failure origin beyondok: false.
- Default: minimal response
GET /api/system/health— authenticated (viewer+), internal detail (Docker, Firecracker, Cloud Hypervisor, Postgres, Redis, etc.) — for the UI or tooled operations.- HTTP 200 when Docker, Postgres, and Redis are healthy. Firecracker and Cloud Hypervisor are reported under
servicesbut are optional: missing KVM, binaries, or bridge shows those entries asunhealthyand the top-levelstatusasdegraded, not a failed probe. - HTTP 503 only if a required dependency (Docker, Postgres, Redis) is down.
- HTTP 200 when Docker, Postgres, and Redis are healthy. Firecracker and Cloud Hypervisor are reported under
3.1 Registry and mesh (operational dependency)
Planetary / WAN features (registry, gateway, federation) depend on your deployment of those services and DNS/TLS — there is no single vendor SLA. Treat the registry (and any shared catalog) as a dependency you operate or federate: document who runs it, backups, and failure modes in your runbook. Guide d’assemblage complet (ordre Helm, secrets, vérifs) : MESH_WAN_COMPLETE_DEPLOYMENT.md. Voir aussi MESH_PLANETARY_V1_ADOPTION.md et MESH_FEDERATION_RUNBOOK.md. Staged enablement / kill switches : MESH_ROLLOUT_PLAYBOOK.md.
4. TLS and Reverse Proxy
- Terminate TLS in front of Next (Traefik, nginx, Caddy, cloud LB).
- The middleware sends HSTS; it is only effective behind HTTPS.
- The
app/docker-compose.ymlfile configures Traefik without insecure dashboard/API: do not re-enable--api.insecure=truein production.
4.1 Client IP, mesh, and public JSON-RPC
Do not expose the Next process directly to the Internet if you rely on per-IP controls: public A2A rate limits, MESH_FEDERATION_INBOUND_ALLOWLIST, or reputation counters. Untrusted clients could spoof X-Forwarded-For or X-Client-Ip unless your edge replaces those headers.
-
Set
HIVE_CLIENT_IP_SOURCEinapp(seeapp/.env.example):real_ip— only trustX-Real-IP(configure nginx/Traefik to set it from the TCP client and strip inbound values).xff_first/xff_last— derive the key fromX-Forwarded-Forwhen your proxy appends and does not trust client-supplied chains (seeMESH_PLANETARY_DEV_STACK.md§ gatewayX-Forwarded-For).auto(default) — usesx-client-ipfrom Hive middleware, then first XFF hop, then validatedX-Real-IP.
-
Redis (rate limits, federation
jti, bus): use TLS (rediss://) and ACLs in production; seeMESH_V1_REDIS_BUS.md. -
Mesh bus integrity: set
MESH_BUS_HMAC_SECRET(≥ 32 characters) so subscribers can verifymeshSigon pub/sub payloads. -
Multi-replica WAN gateway (
services/gateway): setGATEWAY_RATE_LIMIT_REDIS_URLto the same Redis as Hive so per-IP JSON-RPC limits are shared across pods (Helm:gateway.rateLimitRedisUrl).
4.2 Outbound fetches (SSRF guard) and workflows
Server-side HTTP(S) from the app is gated so RFC1918 / loopback / metadata targets are blocked unless the URL’s host is allowlisted. This applies to agent import (manifest / agent card), system update checks, marketplace agent-card resolution, inference budget webhooks, mesh WAN delivery webhooks, and workflow HTTP steps (redirects are capped and unsafe redirects fail).
Mesh WAN worker: if you run MESH_WAN_REDIS_WORKER_MODE=webhook, MESH_WAN_REDIS_WORKER_WEBHOOK_URL must either point at a public URL or a host covered by HIVE_EGRESS_FETCH_HOST_ALLOWLIST.
Compose: docker/docker-compose.prod.yml passes these through from your shell / docker/.env. Full comments: app/.env.example.
In-app (admin): Settings → Security stores an additional egress host allowlist (merged with env) and optional workflow code node overrides (inherit / force off / allow). Values apply within ~15s (in-process cache); run DB migrations so instance_ui_settings has the new columns.
5. Bootstrap (First Admin)
- Database migrated: the
hive-appDocker image runs bundled migrations on container start whenDATABASE_URLis set (skip withHIVE_SKIP_MIGRATE=1). Immediately after, it rebuilds the Postgres marketplace index (marketplace_catalog_rows) viamarketplace-index.cjs(skip on secondary replicas withHIVE_SKIP_MARKETPLACE_INDEX=1). Compose defaults toMARKETPLACE_CATALOG_SOURCE=dbso list/search use that index. Outside Docker, usenpm run db:migrate:runthennpm run marketplace:index-sync. Prefer migrations over ad-hocdrizzle-kit pushin production so fresh installs and CI stay reproducible. - If
HIVE_SETUP_TOKENis set: provide the token to the/setupwizard ("Setup token" field) or as a header for an API call. POST /api/setupis rate-limited (Redis); in prod, setting a strong token limits abuse before admin creation.
6. User Accounts
- Public registration:
/api/auth/register+/auth/registerpage ifALLOW_PUBLIC_REGISTRATION=true(default). - Invite-only:
ALLOW_PUBLIC_REGISTRATION=false; account creation by a logged-in admin ("Invite user" modal →POST /api/auth/registerwith role). - API:
Authorization: Bearer <token>header with tokens stored hashed (API tokens table).
7. Key Rotation (Summary)
Procedural details: RUNBOOK.md.
8. Backups
- Backup API under
app/src/app/api/backups/;BACKUP_DIRdirectory. - Archive-side encryption option (
ENCRYPTION_KEYkey). - Include Postgres (dump), Hive backup files, and secrets (vault / secrets manager) in your RPO/RTO policy.
9. Observability
- Structured logs via
LOG_LEVEL(seeapp/src/lib/logger.ts).
9A. OpenTelemetry (OTLP, Optional)
In practice: to enable OTel in prod, set at minimum OTEL_EXPORTER_OTLP_ENDPOINT (base URL of the OTLP/HTTP collector, e.g. http://alloy:4318) and ideally OTEL_SERVICE_NAME (application default: hive). Without an endpoint, the SDK is not loaded — behavior identical to before: structured logs only, no OTel overhead.
- Explicit disable:
OTEL_SDK_DISABLED=true. - Details on mesh spans/metrics (
hive.mesh.*):MESH_OBSERVABILITY.md; commented variables inapp/.env.example. docker/docker-compose.prod.ymlacceptsOTEL_EXPORTER_OTLP_ENDPOINT,OTEL_SERVICE_NAME,OTEL_METRIC_EXPORT_INTERVAL_MS(empty = no export, same as local without variables).- Network: never publicly expose OTLP ports / Prometheus scrape / Tempo API; use a collector on an internal network (same docker network, VPC, private mesh).
- Operator UI: Hive
/observability(admin) whenPROMETHEUS_OBSERVABILITY_URL/TEMPO_OBSERVABILITY_URLare set; plus Prometheus :9090 for ad-hoc queries and Alerts. Seeobservability/README.md. - Before public beta: checklist
PRE_PUBLIC_BETA_CHECKLIST.md.
9B. Observability stack in prod (Docker observability profile)
The docker/docker-compose.prod.yml file defines an observability profile: Tempo, Prometheus, OpenTelemetry Collector — not mixed into the Hive app image. Grafana is not included; use Prometheus UI, Alertmanager (see observability/ALERTING.md), and Hive /observability.
cd docker
docker compose -f docker-compose.prod.yml --profile observability up -d
- Internal URLs (same Docker network): Prometheus
http://prometheus:9090, Tempohttp://tempo:3200, OTLP HTTP collectorhttp://otel-collector:4318. Do not expose scrape or Tempo ports on the public Internet without authentication. - To receive traces from Hive:
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318onhive-app.
9C. Native observability (Hive UI)
PROMETHEUS_OBSERVABILITY_URL— Prometheus URL reachable from thehive-appcontainer (e.g.http://prometheus:9090with theobservabilityprofile).TEMPO_OBSERVABILITY_URL— Tempo URL (e.g.http://tempo:3200) for trace listing and JSON detail;service.namefilter =OTEL_SERVICE_NAME(environment variable of the Hive process, application defaulthivein the OTel SDK)./observabilityUI (admin role): predefined PromQL charts; traces via Tempo API without arbitrary client-side TraceQL.
10. A2A SDK
The packages/a2a-sdk package is documented in packages/a2a-sdk/docs/ARCHITECTURE.md. Next integration is described in A2A_INTEGRATION.md; multi-worker operations in A2A_OPS_AUDIT.md.
Variables: app/.env.example (prefixes A2A_*, MESH_*, MESH_FEDERATION_*).
11. API / RBAC Matrix (Summary)
Routes under /api/* generally call authorize("viewer" | "operator" | "admin"); common exceptions:
GET /api/health— public.GET /api/auth/registration-status,GET /api/setup/status— public (rate-limited when applicable).POST /api/setup— public until first admin, then rejected.POST /api/auth/*(login, register, reset…) — dedicated auth flows.GET /api/a2a/status— viewer+ (A2A config summary for the dashboard).GET /api/mesh/federation/status— viewer+; with?probe=1— operator+ (probes/.well-known/agent-card.jsononMESH_FEDERATION_PEERSorigins).GET /api/mesh/federation/directory— viewer+; indexed JSON of peers (origins + agent card URL) without server-side network call (derived from env).POST /api/mesh/federation/proxy/jsonrpc— operator+; relays JSON-RPC to a peer (peerIndex+MESH_FEDERATION_PEERS) withX-Hive-Federation-JWT(minted on the fly) and, if enabled,X-Hive-Federation-Secret— requiresMESH_FEDERATION_SHARED_SECRET(≥32 chars, identical on paired nodes).POST /api/mesh/wan/ingress— operator+ (session, API token, orHIVE_INTERNAL_TOKEN); ingests a WAN envelope and publishes to the Redis bus — no Internet exposure.POST /api/a2a/jsonrpc— see A2A matrix; ifX-Hive-Federation-JWTorX-Hive-Federation-Secretis present (only one of the two), federated auth applies (no Bearer fallback on the same request). IfMESH_FEDERATION_INBOUND_ALLOWLISTis set, the client IP must match the list.
For per-file detail, grep authorize( in app/src/app/api.