Long-context KV-cache quantization is producing counter-intuitive results in private LLM inference, flipping expectations about FP16 vs quantized caches. In a focused benchmark, the best performing combo was k-f16v-q50 on the Qwen3-30B-A3B-Instruct-2507 with Unsloth Dynamic 2.0 Q4KXL quantization, scoring 16.79% versus a 13.74% FP16-FP16 baseline [1]. The test covers 16k–51k tokens across 131 samples from LongBench-v2. The discrepancy—full-precision cache underperforming—has folks questioning setup, dataset filtering, and answer extraction; the author even says they detailed the process in the repo README and is asking for feedback [1]. The test ran on the llama.cpp server across 16 cache-type-k and cache-type-v combos [1].
Separately, LongCat-Flash-Thinking hype spotlight memory-aware, open-source models. The model on Hugging Face claims 64.5% fewer tokens to hit top-tier accuracy on AIME25 with native tool use, plus a 3x speedup for asynchronous reinforcement learning over synchronous frameworks [2]. The chatter underlines that practical local deployments benefit from memory-friendly approaches, not just raw speed [2].
Closing thought: these threads show benchmarking isn’t optional when you run private LLMs with long contexts. The choice between FP16 and quantized KV caches isn’t just about fidelity—it’s memory, tooling, and how you test it in real workloads [1][2].
References
Getting counter-intuitive results with local KV Cache Quantization Benchmark - am I doing something wrong?
Discusses long-context KV-cache quantization benchmarks (16k–51k tokens) across models, debating FP16 vs quantized caches and statistical significance and methodology concerns.
View sourceLongCat-Flash-Thinking
Promotes LongCat-Flash-Thinking as fast, efficient, open-source model; discusses quantization, memory needs, and comparisons to DeepSeek, Qwen, GLM.
View source