From ULMFiT to Claude Sonnet 4.5, today’s LLM debates ride a long arc. The latest FamilyBench results sharpen who’s ahead in complex reasoning tasks [2].
Foundations that still matter — Long before today’s giants, ULMFiT, ELMo, and BERT showed you could pretrain a model and fine-tune for downstream tasks. The 3-stage fine-tuning approach—pretrain with a causal LM objective, then fine-tune for a downstream task—echoes through today’s methods [1]. The story of Dai and Le (2015) predated ULMFiT in fine-tuning a pretrained model for downstream tasks, underscoring the move to general-purpose pretrained models [1]. The push to blend attention with memory, via Memory Networks and Neural Turing Machines, foreshadowed the transformers era [1]. That history helps explain why embeddings versus bag-of-words debates mattered, and why training tricks from the early days still surface in modern work [1].
Today’s Benchmarks and Providers — The FamilyBench leaderboard spotlights real-world reasoning on a 400-person, 10-generation family tree (~18k tokens) and 189 questions. Top performers include Gemini 2.5 Pro (81.48% accuracy), Claude Sonnet 4.5 (77.78%), DeepSeek R1 (75.66%), and GLM 4.6 (74.60%) [2]. Newcomers on the leaderboard show how fast progress travels across families of models [2].
Open vs Closed AI and Governance — Community chatter highlights a rift: the critique that OpenAI gates access contrasts with praise for OSS20B, which is touted as strong on real-world tasks, often eclipsing many 30B rivals [3]. The debate isn’t just about speed or scale; it’s about safety, openness, and who sets the rules [3].
The throughline is clear: foundational ideas built the scaffolding, while benchmarks and governance shape where the field goes next.
References
A History of Large Language Models
Overview of LLM lineage; cites ULMFiT, CoVE, ELMo, BERT; discusses embeddings, attention, training tricks, and RLHF transformers memory networks debates
View source[Update] FamilyBench: New models tested - Claude Sonnet 4.5 takes 2nd place, Qwen 3 Next breaks 70%, new Kimi weirdly below the old version, same for GLM 4.6
Tests LLMs on FamilyBench family-tree task; reports Claude Sonnet 4.5 gains, Qwen 3 Next 80B progress, GLM 4.6 strong.
View sourceBiggest Provider for the community for at moment thanks to them
Comparisons of OpenAI and open-source models (OSS20B, Qwen, DeepSeek, Claude); open vs closed weights; geopolitical perspectives and market implications.
View source