Docs_archive
LLM optimization
Source docs/llm-optimization.md · All_docs
Technical architecture, implementation, Redis schema, API, security, async proxy. Goal: 3-5x more agents on the same hardware.
This document applies to the Docker deployment: the “host inference service” (Ollama/vLLM) runs alongside Hive, and agent runtimes (Docker / microVM) connect to it.
Some sections mention appliance/systemd paths (e.g. hive-redis.service, hive.env.template, first-boot.sh). For a Docker-only install, treat those as implementation examples and configure the equivalent via Compose/Kubernetes env + container args instead.
Table of Contents
- Overview
- Phase 1: Agent Sleeping (VM Pause/Resume)
- Phase 2: Smart Proxy (Async + Redis)
- Phase 3: Quantization Routing
- Phase 4: Prompt Caching / Pre-warming
- Phase 5: vLLM Integration
- Phase 6: Observability
- Full Redis Schema
- API Reference
- DB Schema (New Tables/Columns)
- Security & Hardening
- Configuration
- Modified/Created Files
- Startup & Lifecycle
1. Overview
Architecture
Agent VM (Firecracker)
└─ socat: localhost:11434 → vsock CID 2:11434
└─ hive-vsock-proxy (asyncio, vsock :11434)
├─ Endpoint allowlist
├─ Rate limiting (120 req/60s/VM)
├─ Tier concurrency (low:2, medium:5, high:10)
├─ Auto-resume (agent paused → wake + forward)
├─ Activity tracking (Redis)
├─ Token counting (NDJSON parse)
└─ Forward → Ollama/vLLM (127.0.0.1:11434)
Gains per Phase
2. Phase 1: Agent Sleeping
Concept
Firecracker natively supports PATCH /vm {"state":"Paused"} / {"state":"Resumed"}. The vCPUs are frozen, RAM remains resident. Resume ~125ms.
An agent idle for 5min (configurable) is auto-paused. When an inference request arrives, the proxy wakes it up automatically.
Firecracker API
// app/src/lib/firecracker.ts
export async function pauseVM(vmId: string): Promise<void> {
assertSafeId(vmId, "vmId");
const socketPath = path.join(JAILER_CHROOT_BASE, "firecracker", vmId, "root", "firecracker.sock");
await firecrackerAPI(socketPath, "PATCH", "/vm", { state: "Paused" });
}
export async function resumeVM(vmId: string): Promise<void> {
assertSafeId(vmId, "vmId");
const socketPath = path.join(JAILER_CHROOT_BASE, "firecracker", vmId, "root", "firecracker.sock");
await firecrackerAPI(socketPath, "PATCH", "/vm", { state: "Resumed" });
}
Runtime Abstraction
// app/src/lib/runtime.ts
export const pauseInstance = pauseVM;
export const resumeInstance = resumeVM;
API Routes
POST /api/agents/{id}/pause (operator)
- Verify status === "running"
pauseInstance(agent.instanceId)- DB: status → "paused"
- Redis:
SET hive:agent:paused:{id} 1 EX 86400 - Publish events (
agent.paused) - Audit log
POST /api/agents/{id}/resume (operator)
- Verify status === "paused"
resumeInstance(agent.instanceId)- DB: status → "running"
- Redis:
DEL hive:agent:paused:{id} - Publish events (
agent.resumed) - Audit log
Idle Detector
// app/src/lib/idle-detector.ts
export function startIdleDetector(): void // Called at startup
export function stopIdleDetector(): void // Called at shutdown
// Constants
CHECK_INTERVAL_MS = 30_000 // Checks every 30s
DEFAULT_IDLE_THRESHOLD_S = 300 // 5 minutes without activity → pause
Logic:
- Query DB: all agents with
status = "running" - For each agent: read
hive:agent:activity:{id}from Redis - If no activity record → grace period (set timestamp, skip)
- If
idle > threshold→pauseVM()+ update DB + Redis flag + events
Env config:
AUTO_SLEEP_ENABLED=true|falseAUTO_SLEEP_IDLE_SECONDS=300
Auto-Resume (Proxy Side)
When the proxy receives a request from a paused agent:
- Read
hive:agent:paused:{agentId}from Redis - POST
{HIVE_API_URL}/api/agents/{id}/resumewithBearer HIVE_INTERNAL_TOKEN - Wait 150ms (Firecracker wake time)
- Forward the request normally
Frontend
3 button states on the agent detail page:
- running → Pause (yellow) + Stop (red)
- paused → Resume (green) + Stop (red)
- stopped/created/error → Start (green)
Paused badge: yellow dot #EAB308, bg #EAB3081A.
CID → agentId Mapping
The proxy sees CIDs (vsock integers), not UUIDs. The mapping is done via Redis:
// At agent start:
const vmMeta = await getVMMetadata(agent.instanceId);
await r.set(`hive:vm:cid:${vmMeta.vsockCID}`, agentId, "EX", 86400);
// At stop:
const vmMeta = await getVMMetadata(agent.instanceId);
await r.del(`hive:vm:cid:${vmMeta.vsockCID}`);
3. Phase 2: Smart Proxy
Async Architecture
The proxy is rewritten in pure asyncio (zero threads). Handles 200+ concurrent connections on a single event loop.
# os/scripts/hive-vsock-proxy.py
async def main():
await init_redis() # redis.asyncio
server = socket.socket(AF_VSOCK, SOCK_STREAM)
server.listen(256)
server.setblocking(False)
loop = asyncio.get_running_loop()
while not shutdown_event.is_set():
client, addr = await loop.sock_accept(server)
asyncio.create_task(handle_connection(client, peer_cid))
Connection Handler Flow
1. loop.sock_recv() — Read HTTP headers
2. Parse request line (method, path)
3. Endpoint allowlist check → 403
4. Rate limit check → 429
5. Body size check → 413
6. Read remaining body
7. Parse JSON body (model, prompt)
8. Auto-resume if paused
9. Track activity in Redis
10. Tier concurrency check → 503
11. asyncio.open_connection() → upstream
12. Stream response + capture for token counting
13. Count tokens (NDJSON parse)
14. Track tokens in Redis
15. Release tier slot
Allowed Endpoints
OLLAMA_ENDPOINTS = {
("POST", "/api/generate"),
("POST", "/api/chat"),
("POST", "/api/embeddings"),
("POST", "/api/embed"),
("GET", "/api/tags"),
("POST", "/api/show"),
("GET", "/api/version"),
("GET", "/"),
}
VLLM_ENDPOINTS = {
("POST", "/v1/chat/completions"),
("POST", "/v1/completions"),
("POST", "/v1/embeddings"),
("GET", "/v1/models"),
}
Everything else is blocked (no DELETE /api/delete, no POST /api/pull, etc.).
Rate Limiting
Sliding window, 120 requests / 60 seconds per CID. In-memory (not Redis) for zero latency.
Tier Concurrency
TIER_CONCURRENCY = {
"low": 2, # Max 2 simultaneous inference requests
"medium": 5,
"high": 10,
}
The tier is read from Redis: hive:agent:tier:{cid}.
Token Counting
Parses the NDJSON stream from Ollama to extract prompt_eval_count and eval_count:
def count_tokens_in_response(response_data: bytes) -> tuple[int, int]:
for line in response_data.split(b"\n"):
obj = json.loads(line)
tokens_in += obj.get("prompt_eval_count", 0)
tokens_out += obj.get("eval_count", 0)
return tokens_in, tokens_out
Stored in Redis hash: HINCRBY hive:agent:tokens:{agentId} input/output.
Token Sync Daemon
// app/src/lib/token-sync.ts
// Every 60s: Redis → PostgreSQL
async function syncTokens(): Promise<void> {
const keys = await scanKeys("hive:agent:tokens:*"); // SCAN, not KEYS
for (const key of keys) {
const counters = await r.hgetall(key);
await r.del(key); // Atomic reset
await db.insert(inferenceUsage).values({ agentId, model, tokensIn, tokensOut });
await db.update(agents).set({
totalTokensIn: sql`COALESCE(total_tokens_in, 0) + ${tokensIn}`,
totalTokensOut: sql`COALESCE(total_tokens_out, 0) + ${tokensOut}`,
});
}
}
4. Phase 3: Quantization Routing
Concept
A "low" tier agent does not need FP16. It is automatically routed to Q4_0 (2GB VRAM).
// app/src/lib/model-router.ts
const TIER_QUANT_MAP: Record<string, string[]> = {
low: ["q4_0", "q4_K_M", "q4_K_S"], // 2GB VRAM
medium: ["q8_0", "q5_K_M", "q5_K_S"], // 4-5GB VRAM
high: ["f16", "q8_0"], // 8GB+ VRAM
};
Functions
export function resolveModel(
requestedModel: string, // "llama3.2"
tier: string, // "low"
availableModels: string[] // ["llama3.2:q4_0", "llama3.2:q8_0", ...]
): string // → "llama3.2:q4_0"
export async function getAvailableModels(): Promise<string[]>
// Redis cache: hive:models:available (TTL 300s)
export async function refreshAvailableModels(): Promise<string[]>
// Fetch /api/tags, cache result
5. Phase 4: Prompt Caching / Pre-warming
Prompt Cache
Agents with the same system prompt share the same prefix KV cache in vLLM.
// app/src/lib/prompt-cache.ts
export function hashPrompt(systemPrompt: string): string
// SHA-256, truncated to 16 hex chars
export async function registerAgentPrompt(agentId: string, systemPrompt: string): Promise<string>
// Redis: agent → hash, hash → prompt, agents set
export async function unregisterAgentPrompt(agentId: string): Promise<void>
// Cleanup on stop/delete
export async function getPromptShareCount(systemPrompt: string): Promise<number>
// How many agents share this prompt
Redis keys:
hive:agent:prompt:{agentId}→ hash (24h TTL)hive:prompt:{hash}→ full prompt text (24h TTL)hive:prompt:agents:{hash}→ SET of agentIds
Prompt Warmer
At agent start/resume, pre-loads the system prompt into the KV cache:
// app/src/lib/prompt-warmer.ts
export async function prewarmAgent(agentId: string, config: Record<string, unknown>): Promise<void>
// Sends: POST /api/generate { model, system: systemPrompt, prompt: ".", options: { num_predict: 1 } }
// Non-blocking, errors logged but not propagated
// Result: first real request ~10x faster
Agent config:
{
"systemPrompt": "You are a customer service agent...",
"prewarmOnStart": true,
"model": { "name": "llama3.2:q8_0" }
}
6. Phase 5: vLLM Integration
systemd Service
# os/config/.../hive-inference-vllm.service
ExecStart=/usr/local/bin/python3 -m vllm.entrypoints.openai.api_server \
--model /var/lib/hive/models/default \
--host 127.0.0.1 \
--port 11434 \
--enable-prefix-caching \ # Shared KV cache
--enable-chunked-prefill \ # Better batching
--max-num-seqs 64 \ # Continuous batching
--gpu-memory-utilization 0.90 # 90% VRAM
Security: NoNewPrivileges=true, ProtectSystem=strict, ProtectHome=true, PrivateTmp=true.
Backend Switcher
// app/src/lib/inference-backend.ts
export type InferenceBackend = "ollama" | "vllm";
const SERVICE_MAP = {
ollama: "hive-inference",
vllm: "hive-inference-vllm",
};
export async function getActiveBackend(): Promise<InferenceBackend>
// systemctl is-active --quiet hive-inference-vllm → "vllm", otherwise "ollama"
export async function getBackendStatus(): Promise<{
backend: InferenceBackend;
running: boolean;
models: string[];
}>
export async function switchBackend(target: InferenceBackend): Promise<boolean>
// 1. systemctl stop current
// 2. systemctl disable current
// 3. systemctl enable target
// 4. systemctl start target
// 5. Health check (30 retries × 2s = 60s timeout)
// 6. Redis: SET hive:inference:backend target
// 7. On failure → restore previous backend
API
GET /api/system/inference → { backend, running, models }
POST /api/system/inference (admin) → { backend: "vllm" } → switch
The proxy and agents see no difference: same port 11434.
7. Phase 6: Observability
Stats API
GET /api/system/inference/stats (viewer)
{
backend: "vllm" | "ollama",
tokensLastHour: { totalIn, totalOut, requests },
tokensLastDay: { totalIn, totalOut, requests },
topAgents: [{ id, name, tokensIn, tokensOut, tier, status }], // Top 10
vram: {
gpus: [{ index, name, totalMB, usedMB, freeMB }],
totalMB, usedMB
},
activeAgents: number, // running + paused
concurrentRequests: number // activity keys count
}
VRAM: nvidia-smi --query-gpu=index,name,memory.total,memory.used,memory.free --format=csv.
Per-Agent Usage
GET /api/agents/{id}/usage?period=24h&limit=100 (viewer)
{
agentId, period,
totals: { tokensIn, tokensOut },
byModel: [{ model, totalIn, totalOut, count }],
recent: [{ id, model, tokensIn, tokensOut, durationMs, createdAt }]
}
Periods: 1h, 6h, 24h, 7d, 30d.
8. Full Redis Schema
Pub/Sub Channels:
hive:agent:status— AgentStatusEventhive:agent:logs— Agent log eventshive:system:events— SystemEvent
9. API Reference
Agent Lifecycle
System Inference
Internal Auth
The proxy uses Bearer HIVE_INTERNAL_TOKEN for service-to-service calls (auto-resume). This token is recognized in authorize.ts as the operator role without a DB lookup.
10. DB Schema
New agents Fields
ALTER TABLE agents ADD COLUMN inference_tier inference_tier DEFAULT 'medium';
ALTER TABLE agents ADD COLUMN preferred_model VARCHAR(255);
ALTER TABLE agents ADD COLUMN total_tokens_in INTEGER DEFAULT 0;
ALTER TABLE agents ADD COLUMN total_tokens_out INTEGER DEFAULT 0;
ALTER TABLE agents ADD COLUMN last_active_at TIMESTAMP;
New inference_usage Table
CREATE TABLE inference_usage (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
agent_id UUID NOT NULL REFERENCES agents(id),
model VARCHAR(255) NOT NULL,
tokens_in INTEGER NOT NULL DEFAULT 0,
tokens_out INTEGER NOT NULL DEFAULT 0,
duration_ms INTEGER,
created_at TIMESTAMP NOT NULL DEFAULT NOW()
);
CREATE INDEX inference_usage_agent_id_idx ON inference_usage(agent_id);
CREATE INDEX inference_usage_created_at_idx ON inference_usage(created_at);
Enum
CREATE TYPE inference_tier AS ENUM ('low', 'medium', 'high');
All migrations are additive, nullable/default. Zero downtime.
11. Security & Hardening
Critical Bugs Fixed
Bug 1: CID → agentId Mapping Broken
Problem: The proxy looked up hive:vm:cid:{cid} but the start route set hive:vm:instance:{instanceId}. CID (vsock integer) ≠ instanceId (VM string).
Fix:
- Added
getVMMetadata()export infirecracker.ts createAgentVMreturnsvsockCIDin addition tovmIdandipAddress- Start route:
SET hive:vm:cid:{vmMeta.vsockCID} agentId EX 86400 - Stop route:
DEL hive:vm:cid:{vmMeta.vsockCID}
Bug 2: Internal Token Auth 401
Problem: The proxy sent Bearer HIVE_INTERNAL_TOKEN for auto-resume, but authorize("operator") only knew about JWT sessions and API tokens via DB (SHA-256 hash lookup). The internal token was not in apiTokens → always 401.
Fix in authorize.ts:
async function authorizeByToken(token, minimumRole, ip) {
// Internal check BEFORE DB lookup
const internalToken = process.env.HIVE_INTERNAL_TOKEN;
if (internalToken && token === internalToken) {
return {
authorized: true,
session: null,
user: { id: "system", name: "Hive Internal", email: null },
role: "operator",
ip,
};
}
// ... then standard DB lookup
}
Redis KEYS → SCAN
Problem: redis.keys("hive:agent:tokens:*") blocks Redis on large datasets (atomic O(N)).
Fix: scanKeys() helper with iterative SCAN:
export async function scanKeys(pattern: string, count = 100): Promise<string[]> {
const result: string[] = [];
let cursor = "0";
do {
const [nextCursor, keys] = await r.scan(cursor, "MATCH", pattern, "COUNT", count);
cursor = nextCursor;
result.push(...keys);
} while (cursor !== "0");
return result;
}
Used in: token-sync.ts, stats/route.ts.
TTL on All Redis Keys
Problem: App crash → permanent orphaned keys (hive:agent:paused:*, hive:vm:cid:*, etc.).
Fix: All lifecycle keys have a 24h TTL (EX 86400):
hive:vm:cid:{cid}— start routehive:vm:instance:{instanceId}— start routehive:agent:paused:{id}— pause route + idle detectorhive:agent:prompt:{agentId}— prompt cache
Redis Auth (Password) — Docker
Problem: Redis without a password → any process on the host can read/write.
Fix:
- Run Redis with
--requirepass ${REDIS_PASSWORD}(or ACLs), and pass credentials via env/secrets. - Set
REDIS_URL=redis://:PASSWORD@redis:6379(parsed by ioredis and redis-py). - Generate a strong password once (e.g.
openssl rand -hex 32) and store it in your secrets manager /docker/.env(chmod 600).
Async Proxy (Threading → Asyncio)
Problem: threading caps out at ~200 concurrent connections (GIL + stack memory per thread).
Fix: Complete rewrite in asyncio:
loop.sock_accept/recv/sendallfor vsockasyncio.open_connectionfor upstream TCPredis.asynciofor non-blocking Redisasyncio.to_threadfor sync HTTP (auto-resume)server.listen(256)doubled backlog- Single event loop → thousands of concurrent connections
Other Security Measures
- Endpoint allowlist: only inference endpoints are allowed through the proxy
- Rate limiting: 120 req/60s per VM
- Body size limit: 16MB max
- vLLM service hardening:
NoNewPrivileges,ProtectSystem=strict,ProtectHome=true,PrivateTmp=true - Redis localhost-only:
-p 127.0.0.1:6379:6379 - Jailer: VMs run under UID 1500, isolated chroot
- Audit logs: all pause/resume/start/stop actions are traced
12. Configuration
hive.conf [inference]
[inference]
enabled = auto
backend = ollama
port = 11434
vsock_port = 11434
default_model = llama3.2
auto_sleep_enabled = true
auto_sleep_idle_seconds = 300
token_tracking = true
quantization_routing = false
tier_low_models =
tier_medium_models =
tier_high_models =
vllm_max_sequences = 64
vllm_gpu_memory_util = 0.90
vllm_prefix_caching = true
vllm_speculative_model =
Docker / Compose env (example)
# Redis
REDIS_URL=redis://:password@redis:6379
# Inference
INFERENCE_BACKEND=ollama
INFERENCE_PORT=11434
# Internal auth (service-to-service)
HIVE_INTERNAL_TOKEN=<generated>
# Agent sleeping
AUTO_SLEEP_ENABLED=true
AUTO_SLEEP_IDLE_SECONDS=300
Proxy Environment Variables
VSOCK_PORT=11434
INFERENCE_HOST=127.0.0.1
INFERENCE_PORT=11434
REDIS_URL=redis://:password@localhost:6379
HIVE_INTERNAL_TOKEN=<auto-generated>
HIVE_API_URL=http://localhost:3000
INFERENCE_BACKEND=ollama
13. Modified/Created Files
Created Files (12)
Modified Files (14)
14. Startup & Lifecycle
Startup Sequence (conceptual)
1. Docker Compose / Kubernetes (one-time setup + boot order)
├─ Generate and store secrets (AUTH_SECRET, ENCRYPTION_KEY, HIVE_INTERNAL_TOKEN, Redis password)
├─ Start Redis (auth enabled) + PostgreSQL
├─ Run migrations (init job or app entrypoint)
├─ Start inference service (Ollama/vLLM)
└─ Start `hive-app` behind a reverse proxy
2. hive-app startup
├─ startIdleDetector() → check every 30s
├─ startTokenSync() → flush every 60s
└─ refreshAvailableModels() → cache every 5min
3. Agent lifecycle
├─ POST /api/agents/{id}/start
│ ├─ startVM()
│ ├─ Redis: CID mapping, activity timestamp
│ └─ prewarmAgent() (if config.prewarmOnStart)
│
├─ (idle 5min)
│ └─ Idle detector → pauseVM() → status: paused
│
├─ (inference request arrives at proxy)
│ ├─ Proxy: check hive:agent:paused:{id}
│ ├─ Proxy: POST /api/agents/{id}/resume
│ ├─ Proxy: wait 150ms
│ └─ Proxy: forward to Ollama/vLLM
│
└─ POST /api/agents/{id}/stop
├─ stopVM()
└─ Redis: cleanup CID, instance, activity, paused keys