Enabling 8B Bitwise Autoregressive Image Generation on Edge GPUs
Authors: Vezzali, Enrico; Bolelli, Federico; Grana, Costantino; Benini, Luca; Li, Yawei
Visual Autoregressive (VAR) models face a severe "Memory Wall" on edge devices due to large model size and substantial KV-cache … (Read full abstract)
Visual Autoregressive (VAR) models face a severe "Memory Wall" on edge devices due to large model size and substantial KV-cache requirements. In this work, we analyze the Infinity VAR family (2B and 8B) and propose a compression pipeline for deployment on constrained NVIDIA Jetson systems. We diagnose critical bottlenecks: activation outliers reaching 353x the median and channel-skewed cache variance. To address this, we propose a hybrid pipeline combining SVDQuant—to structurally decouple weight outliers—and Asymmetric Per-Channel KV8 quantization. Our approach reduces the Infinity-8B footprint by 64% (37.1GB →13.3GB), fitting it on the mid-range Orin NX with a 4.1x speedup over Flux.1-dev (W4A4), while achieving superior aesthetic alignment (ImageReward 1.13 vs 0.935). Crucially, we also unlock entry-level feasibility for the Infinity-2B, compressing it from 16.0 to 7.71 GB to enable deployment on the Orin Nano. These results establish a new efficiency standard for high-fidelity generative AI at the edge. The code is available at https://github.com/Henvezz95/deepcompressor.