Benchmarking Myths: The Real-World Tradeoffs of Speed, Cost, and Usefulness in LLMs

Benchmark chaos in LLMs is real: official benchmarks clash with everyday usefulness. Real work happens in tokens, costs, and hardware quirks, not glossy numbers.

Chaos in Benchmarks — From the claim that benchmarks are a “bad joke” to wild variations across prompts and models, the landscape feels like the Wild West. Prompts become tightly coupled to models, upgrades break porting, and human ratings can be gamed ^[1].

Token Cost Metrics — Tokuin is a CLI that estimates tokens/costs across OpenAI, Gemini, Anthropic-style models and can run load tests with dry runs ^[2]. It auto-detects providers, tracks latencies, histograms, and costs; token estimation uses tiktoken-rs and a simple pricing registry ^[2]. This kind of tooling pushes cost visibility beyond buzzwords and into real testing signals.

Hardware Reality — A thread reports Qwen3-1.7B delivering about 5t/s on a 3090 versus a 3050, despite the latter’s tiny 6GB memory and far lower bandwidth (936GB/s vs 168GB/s) ^[3]. One commenter adds a blunt take: “Dont use Transformers, it is basically the slowest way to run. It is so slow that gpu doesnt matter..” ^[3]. The takeaway: hardware speed is real, but how you run the model (libraries, deployment) can swamp raw bandwidth.

Bottom line: speed, cost, and usefulness don’t always align. The real signal comes from direct testing and practical tooling, not glossy benchmarks.

^[1] ^[1] ^[2] ^[3]

References

[1]

HackerNews

AI benchmarks are a bad joke – and LLM makers are the ones laughing

LLM benchmarks chaotic; model prompts vary; prefers Gemini; criticizes A/B testing and human evals; calls for better, causal benchmarking methods

View source

[2]

HackerNews

Rust CLI estimates tokens/costs, runs load tests against OpenAI, Anthropic, and others; reports latency, costs, retries, and provides details overview.

View source

[3]

How come my 3090 is just as fast as my 3050 for Qwen3-1.7B?

User compares 3090 vs 3050 throughput for Qwen3-1.7B; responses discuss GPU setup, engines, FP16/BF16, batching, and vLLM.

View source

References

AI benchmarks are a bad joke – and LLM makers are the ones laughing

How come my 3090 is just as fast as my 3050 for Qwen3-1.7B?

Want to track your own topics?