Benchmark battles are reshaping how we compare LLMs in the wild. Real-world constraints—speed, cost, and context windows—are guiding debates from agentic coding to multi-turn chats.
Agentic Coding Benchmarks — One post flags that up-to-date benchmarks for agentic coding are scarce, noting that gso-bench is behind and Grok is missing. [1]
JanitorBench — JanitorBench is a new benchmark for multi-turn chats. [2]
GPT-4o vs GPT-4o-Mini — A quick experiment shows GPT-4o and GPT-4o-Mini rank the same content differently; the "mini" model exhibits subtly different preferences. [3]
GLM-4.5-Air — Full-context results for GLM-4.5-Air show tests on Strix Halo and a dual RTX 3090 setup, with logs detailing startup, eval time, and total time to illustrate how hardware and context size shift throughput. [4]
Concurrency and fast batching — Tail latency under static batching hurts p99 performance, and the KV cache can waste GPU cycles. The fixes point to asynchronous control plus continuous batching with pagedAttention, a pattern seen in vLLM, Hugging Face TGI, and TensorRT-LLM. [5]
Closing thought: as benchmarks migrate from raw numbers to real-world constraints, speed, cost, and context window management will keep steering which models teams actually choose.
References
Ask HN: What are most up-to-date LLM Benchmarks for Agentic Coding
Hacker News user seeks current benchmarks comparing LLMs on speed, quality, cost, for coding and tool use
View sourceJanitorBench: A new LLM benchmark for multi-turn chats
JanitorBench proposes an LLM benchmark for multi-turn chats to evaluate performance and capabilities of language models
View sourceComparing GPT-4o vs. GPT-4o-Mini: How Different AI Models Rank the Same Content
Experiment compares two AI models' article rankings; reveals differing evaluation criteria between GPT-4o and GPT-4o-mini; includes code and logs, too.
View sourceBenchmark Results: GLM-4.5-Air (Q4) at Full Context on Strix Halo vs. Dual RTX 3090
GLM-4.5-Air benchmarks compare Strix Halo to dual RTX 3090; covers offload, PCIe bandwidth, VRAM limits, Vulkan/CUDA, and llamacpp/vLLM debate continues
View sourceFiring concurrent requests at LLM
Discusses concurrency, static vs continuous batching, pagedAttention; tools like vLLM, HF TGI, TensorRT-LLM; fixes latency, throughput and resource utilization significantly.
View source