Are LLMs truly reasoning, or just predicting next tokens? Real-world tests are fueling the debate as researchers push Gemma-based ideas into biology and medicine [1].
Gemma-based hypothesis testing - C2S-Scale-Gemma-2-27B (27B, based on Gemma), shown by Google to generate a novel cancer hypothesis; resources live on HuggingFace and GitHub; bioRxiv preprint discusses results [2]. - A companion blog post and the preprint lay out the approach and data, sparking both excitement and caution [2].
Counsel Health - In the medical QA space, Counsel Health pairs LLMs with licensed physicians to supervise care, backed by a $25M Series A led by Andreessen Horowitz (a16z) and GV. Reporters note the human-in-the-loop matters for safety and practical care delivery more than flashy AI chatter [3].
Broader critiques and testing - Critics argue LLMs are often just statistical next-token guessers, even as some threads celebrate how they riff on ideas when given threads to pull on [1]. - Still, there are hard knobs to test: explicit confidence bounds exist, and verification is often cheaper than generation [2]. In tax work, TaxCalcBench shows modern models struggle to file near-perfect returns, highlighting where scaffolding helps more than magic [5].
Practical accuracy and tooling - Tools and scaffolds matter. For example, BankToBudget uses GPT-5 under the hood to interpret messy bank exports and suggest a monthly budget—demonstrating tangible wins when you pair models with focused tasks [4].
Takeaway: the line between prediction and reasoning is still being drawn, one real-world test at a time. Watch how benchmarks, guardrails, and clinician oversight evolve in 2025 and beyond.
Referenced posts: [1], [2], [3], [4], [5]
References
but can someone correct me, I'm curious how an LLM can generate new hypotheses if it is based only on the prediction of the next token, isn't gemma a simple LLM trained on medical data ?
Debate on whether LLMs truly reason or merely predict next tokens; includes mechanisms, experiments, and novelty claims about real understanding.
View sourceGoogle C2S-Scale 27B (based on Gemma) built with Yale generated a novel hypothesis about cancer cellular behavior - Model + resources are now on Hugging Face and GitHub
Discusses Gemma-2 27B LLM used for cancer hypothesis and drug screening; debates novelty, validation, and usefulness of LLMs in biology
View sourceShow HN: Counsel Health ($25M Series A) – LLMs for Medical QA and Chat with MDs
Launch of health AI platform with physician oversight, rapid care; discusses moat, records, post-training updates.
View sourceShow HN: BankToBudget – Instantly turn your bank exports into a monthly budget
Shows practical use of GPT-5 in parsing bank data and categorizing transactions; seeks feedback on accuracy and features for improvement.
View sourceTaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task
TaxCalcBench compares frontier LLMs on tax calculations; Gemini beats Claude; discussions cover tools, risks, reliability, and policy implications.
View source