Attention-free designs are sparking a rethink of long-context LLMs. Brumby-14B-Base claims to beat the bottleneck with power retention layers and no traditional attention, promising long-context speedups [1].
Attention-Free vs Linear/Hybrid Approaches Brumby-14B-Base trades in attention for power retention; the claim is 'attention-free', with kernels reportedly hundreds of times faster on long contexts [1]. Still, reviewers note a regular attention path sneaks in for decoding, fueling skepticism [2].
Kimi Linear and KDA: Long-Context by Efficiency Moonshot AI's Kimi-Linear-48B-A3B-Instruct uses Kimi Delta Attention (KDA), a linear attention kernel that blends a gating DeltaNet concept with finite-state memory to boost long-context throughput [3]. The approach trims memory by reducing KV caches by up to 75% and delivers up to 6x decoding speed for contexts up to 1M tokens, with two checkpoints publicly released [3]. It’s open-source and released under the MIT license, supported by the FLA kernel in code [3].
Industry Reality Check: Do We Still Need Full Attention? MiniMax pretraining lead explains that, in production, efficient attention still hasn't beaten full attention; compute budgets and broader stack trade-offs matter for real-world TPS, price, and quality [4].
Bottom line: long-context viability without full attention is still a live debate, with attention-free and hybrid paths racing to prove themselves in hardware-constrained setups.
POST IDs referenced: [1, 2, 3, 4]
References
Brumby-14B-Base: The Strongest Attention-Free Base Model
Brumby-14B-Base claims competitive performance with no attention layers, using power retention layers; Transformer-style architecture; released on Huggingface.
View sourcemanifestai releases Brumby-14B-Base weights, claims "attention free" and inference "hundreds of time faster" for long context
Discusses Brumby-14B-Base attention-free claim, long-context speed, 'power retention' approach, skepticism, comparisons to attention and non-attention models, local inference for LLMs
View sourcemoonshotai/Kimi-Linear-48B-A3B-Instruct · Hugging Face
Presents Kimi Linear, a hybrid linear attention LLM; claims efficiency, long-context superiority, open-sourcing, licenses, and comparisons to full attention baselines.
View sourceMinimax pre-training lead explains why no linear attention
MiniMax-M2 lead explains why production favors full attention over linear attention; discusses compute limits, evaluation, and future efficiency trade-offs overall.
View source