Open Weights vs Hosted LLMs: Trust, Quantization, and the Benchmark Quandary

Open weights vs hosted LLMs collide over trust, quantization, and the benchmark mess. The big question: will third-party hosting lobotomize a model to chase speed or cost, or can you rely on open-weight tinkering to stay in control? The debate pops up in threads about DeepInfra, AtlasCloud, and Moonshot AI.

Trust and Lobotomy Fears Trust is the skinniest tightrope. People worry providers can change a model behind your back, and the line "providers can make silent changes at any point" echoes it. One post even jokes about Achmed Al-Jibani from Qatar:^[1]

Quantization Snapshot Across Providers - DeepInfra — FP4 - Baseten — FP4 - Novita — FP8 - SiliconFlow — FP8 - Fireworks — FP8 - AtlasCloud — FP8 - Together — (not stated) - OpenRouter — shows precision when the provider states it

Kimi K2 0905 results from third-party vendors vary with quantization; Moonshot AI discussions frame the debate around how close 8-bit/FP8 really is to native FP8.^[2]

Benchmarks Under Fire The current state of LLM benchmarks is polluted: cherry picking, contamination, and non-reproducible methods plague the scene, and APIs fluctuate as quants shift.^[3]

Cost vs Reliability and Takeaway - "Fast, Cheap, Accurate. You only can pick two." The trade-off is real in practice.^[1] - Self-hosting remains appealing for tinkering, but most users can’t sustain reasonable token rates.^[1]

Watch for independent benchmarks and provider transparency as 2025 evolves.

References

[1]

How am I supposed to know which third party provider can be trusted not to completely lobotomize a model?

Discussion comparing open/open-weight LLMs vs providers; concerns about quantization, reliability, and trust in third-party hosting for performance and cost tradeoffs.

View source

[2]

Apparently all third party providers downgrade, none of them provide a max quality model

Debate about model quantization (FP4/FP8), OpenRouter vs providers, accuracy versus cost, transparency, and benchmarking validity in LLM ecosystem today online

View source

[3]

The current state of LLM benchmarks is so polluted

Discusses polluted benchmarks, advocates independent, open benchmarks and real-world performance tracking across LLMs and APIs.

View source

References

How am I supposed to know which third party provider can be trusted not to completely lobotomize a model?

Apparently all third party providers downgrade, none of them provide a max quality model

The current state of LLM benchmarks is so polluted

Want to track your own topics?