CALM and the race to memory-efficient LLMs: continuous vectors, not tokens

CALM is rewriting how LLMs generate language—no more token-by-token marching. CALM replaces next-token with next-vector predictions, powered by a high-fidelity autoencoder that compresses K tokens into a single vector and reconstructs them with over 99.9% accuracy ^[3].

Core idea — continuous vectors With CALM, probabilities over continuous spaces can’t be computed by softmax, so researchers built a likelihood-free toolkit for training, evaluation (the BrierLM metric), and sampling ^[3]. The big win: language becomes a sequence of continuous vectors, reducing generative steps by a factor of K ^[3].

Industry momentum — Tencent + Tsinghua CALM Tencent and Tsinghua released a CALM paper showing how text latents from an autoencoder can drive a purely continuous generation path; community threads discuss using those latents with diffusion-like denoising and decoding back to text ^[4].

Memory-focused architectures Discussions highlight architectural strategies to curb peak memory: leaner attention, tiling, and dynamic loading, plus debates on keeping capacity while cutting memory use. The trend points toward memory-efficient LLMs, not just bigger models ^[5].

Closing thought: the CALM era is accelerating. Keep an eye on Tencent and Tsinghua work and communities like LocalLLaMA for the next memory-friendly leap ^[4]^[3].

References

[3]

Instead of predicting one token at a time, CALM (Continuous Autoregressive Language Models) predicts continuous vectors that represent multiple tokens at once

Calm: continuous vectors boost efficiency; contrasts open/closed Chinese and Western models; debate on releasing weights.

View source

[4]

Tencent + Tsinghua just dropped a paper called Continuous Autoregressive Language Models (CALM)

Post discusses CALM paper, continuous autoregressive modeling via latent diffusion, questions about tokenization, OCR-like latent spaces, and related critiques online

View source

[5]

Which Architectural Strategies are Set to Reduce Peak Memory Use?

Discusses architectural approaches to lower LLM peak memory—weight reuse, partitioning, dynamic loading, MoEs, efficient attention, and system-wide tradeoffs.

View source

References

Instead of predicting one token at a time, CALM (Continuous Autoregressive Language Models) predicts continuous vectors that represent multiple tokens at once

Tencent + Tsinghua just dropped a paper called Continuous Autoregressive Language Models (CALM)

Which Architectural Strategies are Set to Reduce Peak Memory Use?

Want to track your own topics?