Docs_archive
VPTQ + TurboQuant (70B on consumer GPU)
Source docs/VPTQ_TURBOQUANT.md · All_docs
Pilox integrates VPTQ (Microsoft) and TurboQuant (Google, ICLR 2026) to run large language models on hardware that was previously impossible.
The Problem
A 70B parameter model requires ~140GB in FP16 — far beyond any consumer GPU. Even Q4 quantization needs ~35GB. This forces teams to either pay for cloud A100s or settle for smaller, less capable models.
The Solution: Dual Compression
Combined effect on a 70B model:
An RTX 3080 (10GB) + 32GB system RAM can run a 70B model with GPU offloading.
How It Works in Pilox
Architecture
User clicks "Pull" on 70B model
|
v
Models API detects VPTQ variant on HuggingFace
| (e.g. VPTQ-community/Llama-3.1-70B-VPTQ-2bit)
v
vLLM loads model with --quantization vptq --cpu-offload-gb auto
|
v
TurboQuant compresses KV cache at runtime (3-bit)
|
v
Agent workflow calls vLLM -> inference on 10GB GPU + RAM offload
Supported Models (pre-quantized on HuggingFace)
VRAM Calculator
The Models page includes a real-time VRAM calculator:
- Green: Model fits entirely in GPU VRAM
- Yellow: Model fits with CPU RAM offload (slower but functional)
- Red: Insufficient total memory
Configuration
Environment variables in docker-compose.local.yml:
vllm:
environment:
VLLM_QUANTIZATION: auto # Detects VPTQ automatically
VLLM_CPU_OFFLOAD_GB: auto # Auto-detect from available RAM
VLLM_ENABLE_PREFIX_CACHING: true # Long-context efficiency
VLLM_KV_CACHE_DTYPE: turboquant # TurboQuant 3-bit KV cache
Performance Expectations
On an RTX 3080 (10GB VRAM) with 32GB RAM:
Canvas Copilot
The built-in Canvas Copilot can use any loaded model. With a 70B model via VPTQ:
- Better JSON format compliance
- More accurate node suggestions
- Deeper reasoning about workflow architecture
Enable in Settings > LLM Providers > Canvas Copilot or during the setup wizard.
Why This Matters
No other self-hosted agent platform offers 70B inference on consumer hardware. This means:
- Privacy: Enterprise data never leaves your infrastructure
- Cost: No cloud GPU bills ($2-8/hr for A100)
- Latency: Local inference, no network round-trips
- Autonomy: No vendor lock-in, no API rate limits