Promises collide with performance in the LLM world. OpenAI admitted to changing GPT-4 without disclosure, and Anthropic won’t acknowledge similar moves, fueling calls for API version pinning and update transparency [1].
Degradation vs Transparency — Launch hype often gives way to shifts in capability as vendors tweak models. Updates to GPT-4 are alleged to slip in without notice, stirring frustration and demands for clearer change logs [1].
Hype vs Reality: Minimax M2 vs GLM-4.6 — Minimax M2 has drawn praise for surface performance, but real coding work reveals gaps and inconsistency in longer tasks [2]. In another take, GLM-4.6 is reported to beat M2 on some benchmarks, though hands-on projects with M2 remain scarce [3].
Local Benchmarks: Small Models on Real Tasks — On a local CLI agent, the lineup looks like this: - Qwen3:4b leads the pack for tool-using prompts [4] - Llama3.2:3b is solid but needs careful prompting to pick the right tools [4] - Granite3.3:8b can be excellent when it works, but reliability varies [4] - Qwen3:0.6b struggles with complex toolchains [4] - Phi4:14b can’t use tools in this setup [4]
Open Source Wins: 86% SimpleQA with gpt-4.1-mini — SGR Deep Research reports 86.1% on SimpleQA using gpt-4.1-mini, with ~500 LOC, no LangChain or CrewAI bloat, and cost around $0.03 per query [5]. It demonstrates small-model architecture can compete when you bring structured reasoning to the table [5].
Bottom line: hype is loud, but real-world, transparent benchmarking matters—and it’s only going to get louder as updates roll out.
References
Argues launches outperform, then degrade; accuses undisclosed model changes; calls for API pinning and transparency, citing Claude 4.5 and GPT-4.
View sourceThe performance of Minimax-m2 is truly impressive!
M2 looks impressive visually; fails real coding, hallucinates, unreliable for serious development; MiniMax touts Claude Code integration
View sourceGLM-4.6 vs Minimax-M2
Users compare GLM-4.6 and Minimax M2 on coding tasks, benchmarks, and real-world use; opinions varied.
View sourceTested a few small models on a local CLI agent. I was surprised by the results.
Benchmarks small LLMs for a local CLI agent; compares Qwen3:4b, Llama3.2:3b, Granite3.3:8b, Phi4:14b; tool calling and latency.
View source86% accuracy on SimpleQA with gpt-4.1-mini. Open-source deep research agent.
Open-source framework SGR Deep Research enables structured reasoning with small LLMs; benchmarks show 86.1% SimpleQA at low cost, ROMA dominates
View source