Back to topics

Benchmarking LLMs Across Tasks: Tax, OCR, and More Highlight Strengths and Blind Spots

1 min read
211 words
Opinions on LLMs Benchmarking Across

LLMs face real-world tests: taxes, OCR, and multi-system benchmarks reveal sharp strengths and stubborn blind spots.

TaxCalcBench shows state-of-the-art models calculate less than a third of federal returns on a simplified sample. The takeaway: basic math and policy reasoning still trip up models—experts say scaffolding and tool access, like a calculator, could lift accuracy [1].

PaddleOCR-VL is touted as a top OCR engine, but results hinge on image quality. About 1080p tends to work best; at 4k it misses text, and it can hallucinate. It does support vertical text and 100 languages, with caveats remaining [2].

SGLang vs TabbyAPI & vLLM benchmarks show big speed gaps but on one rig, SGLang hits 32–37 tokens/sec on a 123B model, while TabbyAPI runs slower; 8-bit KAT-Dev 32B tops around 61.5 tokens/sec. The practical takeaway: inference frameworks matter as much as model size [3].

Gemma debates whether an LLM can generate novel hypotheses from predictions alone. Some see value in the process when cross-checked with tools, while others warn results can be gibberish without careful validation [4].

BankToBudget demonstrates practical use: it turns messy bank exports into a monthly budget, with GPT-5 parsing categories behind the scenes [5].

Closing thought: task-specific tooling and robust evaluation will shape what actually scales in 2025 and beyond.

References

[1]
HackerNews

TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

TaxCalcBench compares frontier LLMs on tax calculations; Gemini beats Claude; discussions cover tools, risks, reliability, and policy implications.

View source
[2]
Reddit

PaddleOCR-VL, is better than private models

PaddleOCR praised; peers compare Qwen3-VL, Gemini, Mistral; discusses vertical text, handwriting, testing, GPU issues, and model strengths versus others overall.

View source
[3]
Reddit

SGLang vs TabbyAPI & vLLM Benchmark Increases (Multi-GPU + Single-GPU)

Benchmarks compare SGLang with TabbyAPI and vLLM across multi-GPU systems, highlighting speedups, quantization, NVLink, virtualization, and power tradeoffs in practice.

View source
[4]
Reddit

but can someone correct me, I'm curious how an LLM can generate new hypotheses if it is based only on the prediction of the next token, isn't gemma a simple LLM trained on medical data ?

Debate on whether LLMs truly reason or merely predict next tokens; includes mechanisms, experiments, and novelty claims about real understanding.

View source
[5]
HackerNews

Show HN: BankToBudget – Instantly turn your bank exports into a monthly budget

Shows practical use of GPT-5 in parsing bank data and categorizing transactions; seeks feedback on accuracy and features for improvement.

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started