Standardized evaluation is breaking out of the lab and into real-world LLM work. Scorecard and SWE-rebench-style benchmarks are making concrete metrics the default, not an afterthought. These tools promise reproducible scoring and faster iteration. [1]
Drawing from Waymo’s drive-sim approach, Scorecard lets you run LLM-as-judge evals on agent workflows—checking tool usage, multi-step reasoning, and task completion in CI/CD or a playground. You can trace failures with OpenTelemetry traces to see where the reasoning loop went wrong. It’s built for collaboration on datasets and evaluation metrics. [1]
On the SWE-rebench front, a leaderboard now covers 49 fresh PR bug-fix tasks. Claude Sonnet 4.5 tops pass@5 at 55.1%, with some instances solved by no other model. Qwen3-Coder is highlighted as the best open-source performer. Other models include GPT-5-Codex, GLM, Qwen, and Kimi. The leaderboard also highlights momentum for open-source players. [2]
Together, these efforts push practitioners to judge model outputs against real tasks, not just marketing claims. Watch how formal evals evolve and shape deployments before big launches. [2]
References
Show HN: Scorecard – Evaluate LLMs like Waymo simulates cars
Shows Scorecard for automated LLM evaluation: judge workflows, trace failures, share datasets, metrics; early customers; non-deterministic bug mitigation in practice.
View sourceWe tested Claude Sonnet 4.5, GPT-5-codex, Qwen3-Coder, GLM and other 25+ models on fresh SWE-Bench like tasks from September 2025
SWE-rebench leaderboard updates; multiple models tested; Claude Sonnet 4.5 leads; debates GLM 4.6, Qwen3-Coder, and agent systems.
View source