Evaluating the direct perplexity costs of aggressive parameter quantization for local endpoints.
Ollama typically bundles models in 4-bit (Q4) or 8-bit (Q8) quantized formats. Q4 offers extreme speed and fits massive parameter arrays into tiny memory envelopes, but it introduces arithmetic rounding errors in long-tail complex reasoning.
"For standard coding and text summarization, Q4 maintains ~95% of the accuracy of unquantized FP16, at 25% of the memory footprint."
— Machine Learning Collective
We recommend Q4_K_M for machines with less than 16GB of RAM, and strictly jumping to Q8_0 or FP16 when mathematical rigor or syntax parsing is the primary objective.