Back to topics

In-Field LLMs: What Domain-Specific Deployments Teach Us About Accuracy, Oversight, and Updates

1 min read
202 words
Opinions on LLMs In-Field LLMs:

In-field LLMs are teaching hard lessons about trust. Across finance, healthcare, tax, and cancer biology, accuracy feedback loops, physician oversight, and post-training updates shape how people rely on AI in real-world products [1][2][3][4].

BankToBudget uses GPT-5 under the hood to interpret messy bank exports into a monthly budget, with a backend in Laravel that cleans data before categorization. Feedback on the accuracy of categories and the clarity of results is welcome [1].

Counsel Health embeds LLMs into care with oversight from licensed physicians, aiming for safer, faster care. The company has a $25M Series A led by Andreessen Horowitz and GV, signaling a push for a health AI moat [2].

TaxCalcBench tests frontier models on tax calculation; researchers note that state-of-the-art models calculate less than a third of returns. A calculator tool could help, and, in this space, Gemini models outperformed Claude on this task [3].

Gemma-based work, including Google's C2S-Scale-Gemma-2-27B built with Yale, sits on Hugging Face and GitHub with a bioRxiv preprint. The effort generated a novel cancer therapy pathway hypothesis, but experts stress humans must validate outputs; some note explicit confidence bounds help with verification [4].

Real-world AI needs ongoing updates and human-in-the-loop oversight to stay trusted and defensible.

References

[1]
HackerNews

Show HN: BankToBudget – Instantly turn your bank exports into a monthly budget

Shows practical use of GPT-5 in parsing bank data and categorizing transactions; seeks feedback on accuracy and features for improvement.

View source
[2]
HackerNews

Show HN: Counsel Health ($25M Series A) – LLMs for Medical QA and Chat with MDs

Launch of health AI platform with physician oversight, rapid care; discusses moat, records, post-training updates.

View source
[3]
HackerNews

TaxCalcBench: Evaluating Frontier Models on the Tax Calculation Task

TaxCalcBench compares frontier LLMs on tax calculations; Gemini beats Claude; discussions cover tools, risks, reliability, and policy implications.

View source
[4]
Reddit

Google C2S-Scale 27B (based on Gemma) built with Yale generated a novel hypothesis about cancer cellular behavior - Model + resources are now on Hugging Face and GitHub

Discusses Gemma-2 27B LLM used for cancer hypothesis and drug screening; debates novelty, validation, and usefulness of LLMs in biology

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started