Attention vs Non-Attention: Can Long Context Be Achieved Without Full Attention?

Attention-free designs are sparking a rethink of long-context LLMs. Brumby-14B-Base claims to beat the bottleneck with power retention layers and no traditional attention, promising long-context speedups ^[1].

Attention-Free vs Linear/Hybrid Approaches Brumby-14B-Base trades in attention for power retention; the claim is 'attention-free', with kernels reportedly hundreds of times faster on long contexts ^[1]. Still, reviewers note a regular attention path sneaks in for decoding, fueling skepticism ^[2].

Kimi Linear and KDA: Long-Context by Efficiency Moonshot AI's Kimi-Linear-48B-A3B-Instruct uses Kimi Delta Attention (KDA), a linear attention kernel that blends a gating DeltaNet concept with finite-state memory to boost long-context throughput ^[3]. The approach trims memory by reducing KV caches by up to 75% and delivers up to 6x decoding speed for contexts up to 1M tokens, with two checkpoints publicly released ^[3]. It’s open-source and released under the MIT license, supported by the FLA kernel in code ^[3].

Industry Reality Check: Do We Still Need Full Attention? MiniMax pretraining lead explains that, in production, efficient attention still hasn't beaten full attention; compute budgets and broader stack trade-offs matter for real-world TPS, price, and quality ^[4].

Bottom line: long-context viability without full attention is still a live debate, with attention-free and hybrid paths racing to prove themselves in hardware-constrained setups.

POST IDs referenced: [1, 2, 3, 4]

References

[1]

HackerNews

Brumby-14B-Base: The Strongest Attention-Free Base Model

Brumby-14B-Base claims competitive performance with no attention layers, using power retention layers; Transformer-style architecture; released on Huggingface.

View source

[2]

manifestai releases Brumby-14B-Base weights, claims "attention free" and inference "hundreds of time faster" for long context

Discusses Brumby-14B-Base attention-free claim, long-context speed, 'power retention' approach, skepticism, comparisons to attention and non-attention models, local inference for LLMs

View source

[3]

moonshotai/Kimi-Linear-48B-A3B-Instruct · Hugging Face

Presents Kimi Linear, a hybrid linear attention LLM; claims efficiency, long-context superiority, open-sourcing, licenses, and comparisons to full attention baselines.

View source

[4]

Minimax pre-training lead explains why no linear attention

MiniMax-M2 lead explains why production favors full attention over linear attention; discusses compute limits, evaluation, and future efficiency trade-offs overall.

View source

References

Brumby-14B-Base: The Strongest Attention-Free Base Model

manifestai releases Brumby-14B-Base weights, claims "attention free" and inference "hundreds of time faster" for long context

moonshotai/Kimi-Linear-48B-A3B-Instruct · Hugging Face

Minimax pre-training lead explains why no linear attention

Want to track your own topics?