Engineering Blog

Engineering: Analyzing Q4_K_M vs Q8_0 Formats

Evaluating the direct perplexity costs of aggressive parameter quantization for local endpoints.

Bits over Bytes

Ollama typically bundles models in 4-bit (Q4) or 8-bit (Q8) quantized formats. Q4 offers extreme speed and fits massive parameter arrays into tiny memory envelopes, but it introduces arithmetic rounding errors in long-tail complex reasoning.

"For standard coding and text summarization, Q4 maintains ~95% of the accuracy of unquantized FP16, at 25% of the memory footprint."

— Machine Learning Collective

We recommend Q4_K_M for machines with less than 16GB of RAM, and strictly jumping to Q8_0 or FP16 when mathematical rigor or syntax parsing is the primary objective.