The economics of LLMs is shifting fast: smaller footprints, smarter quantization, and Mixture-of-Experts tricks can outpace raw size on price. MiniMax-M2 sits at the center of the chatter, with price-to-performance clout that keeps evolving [1].
• MiniMax-M2 — a claim of 230B-A10B, with SVGBench scoring 58.3% and ranking 10th overall, 2nd among open-weight models; observers say it’s less benchmaxxed than M1, hinting at real gains [2].
• gpt-oss-120B on an RX 6900 XT 16GB delivers about 19 tokens/sec; MoE architecture activates only 5.1B parameters per token; 64GB RAM helps, and 58-67GB total is near minimum [3].
• GLM 4.5/ GLM 4.6 face quantization debates; a misconfiguration in docker-compose caused a quality drop when using wrong model-quant; corrected models/quants align with expectations [4].
• Throughput vs size: on tiny GPUs, 4B FP16 ≈ 8 GB; 8B Q4 ≈ 4 GB; bigger models with heavier quantization can beat smaller models with lighter quantization depending on quality goals [5].
Bottom line: price-performance in LLMs keeps bouncing around as quantization, MOE, and hardware unlocks collide; more real-world experiments will define the winners.
References
MiniMax-M2: Intelligence, Performance and Price Analysis
Discusses MiniMax-M2's intelligence, performance, and price; compares to other LLMs; evaluates suitability and cost.
View sourceMiniMax M2 is 230B-A10B
Discusses MiniMax M2 benchmarks, compares to Opus and Grok, OSS 120b; debates benchmarks, open weights, pricing, MOE approaches today here.
View sourceOptimizing gpt-oss-120B on AMD RX 6900 XT 16GB: Achieving 19 tokens/sec
Describes running gpt-oss-120B MoE on RX 6900 XT using CPU offloading and AVX-512, with 19 tokens/second and llama.cpp vs Ollama
View sourceIs GLM 4.5 / 4.6 really sensitive to quantisation? Or is vLLM stupifying the models?
Discusses GLM quantization effects, vLLM usage, Air versus full models; compares quality, precision, and pruning impacts in practice.
View source4B fp16 or 8B q4?
Discusses 4B FP16 vs 8B Q4, quantization tradeoffs, model comparisons (Qwen3, LFM2, OSS-20B), for basic chat replacement.
View source