In-Field LLMs: What Domain-Specific Deployments Teach Us About Accuracy, Oversight, and Updates

In-field LLMs are teaching hard lessons about trust. Across finance, healthcare, tax, and cancer biology, accuracy feedback loops, physician oversight, and post-training updates shape how people rely on AI in real-world products ^[1]^[2]^[3]^[4].

BankToBudget uses GPT-5 under the hood to interpret messy bank exports into a monthly budget, with a backend in Laravel that cleans data before categorization. Feedback on the accuracy of categories and the clarity of results is welcome ^[1].

Counsel Health embeds LLMs into care with oversight from licensed physicians, aiming for safer, faster care. The company has a $25M Series A led by Andreessen Horowitz and GV, signaling a push for a health AI moat ^[2].

TaxCalcBench tests frontier models on tax calculation; researchers note that state-of-the-art models calculate less than a third of returns. A calculator tool could help, and, in this space, Gemini models outperformed Claude on this task ^[3].

Gemma-based work, including Google's C2S-Scale-Gemma-2-27B built with Yale, sits on Hugging Face and GitHub with a bioRxiv preprint. The effort generated a novel cancer therapy pathway hypothesis, but experts stress humans must validate outputs; some note explicit confidence bounds help with verification ^[4].

Real-world AI needs ongoing updates and human-in-the-loop oversight to stay trusted and defensible.

References

[1]

HackerNews

Show HN: BankToBudget – Instantly turn your bank exports into a monthly budget

Shows practical use of GPT-5 in parsing bank data and categorizing transactions; seeks feedback on accuracy and features for improvement.

View source

[2]

HackerNews

Show HN: Counsel Health ($25M Series A) – LLMs for Medical QA and Chat with MDs

Launch of health AI platform with physician oversight, rapid care; discusses moat, records, post-training updates.

View source

[3]

HackerNews

TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

TaxCalcBench compares frontier LLMs on tax calculations; Gemini beats Claude; discussions cover tools, risks, reliability, and policy implications.

View source

[4]

Google C2S-Scale 27B (based on Gemma) built with Yale generated a novel hypothesis about cancer cellular behavior - Model + resources are now on Hugging Face and GitHub

Discusses Gemma-2 27B LLM used for cancer hypothesis and drug screening; debates novelty, validation, and usefulness of LLMs in biology

View source

References

Show HN: BankToBudget – Instantly turn your bank exports into a monthly budget

Show HN: Counsel Health ($25M Series A) – LLMs for Medical QA and Chat with MDs

TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

Google C2S-Scale 27B (based on Gemma) built with Yale generated a novel hypothesis about cancer cellular behavior - Model + resources are now on Hugging Face and GitHub

Want to track your own topics?