Engineering Blog

Engineering: Dynamic KV Cache Memory Offloading

Analyzing state tracking adjustments required to preserve context windows when VRAM constraints throttle concurrency.

The KV Cache Multiplier

During concurrent local model execution (multiplexing multiple streams on a local GPU), key-value (KV) cache grows linearly with sequence length and batch size.

We dissect the memory calculation formula representing the actual VRAM footprint for pre-allocation. Without pruning techniques or flash attention overlays, running Dual Llama-3 instances causes VRAM memory fragmentation warnings quickly.

Flash Attention and Layer Offloading Dynamics

By implementing 8-bit layer-by-layer offloading, Duplex maintains context windows up to 8K tokens on typical 8GB consumer hardware devices, preserving performance without cold-starting inference.