CALM is rewriting how LLMs generate language—no more token-by-token marching. CALM replaces next-token with next-vector predictions, powered by a high-fidelity autoencoder that compresses K tokens into a single vector and reconstructs them with over 99.9% accuracy [3].
Core idea — continuous vectors With CALM, probabilities over continuous spaces can’t be computed by softmax, so researchers built a likelihood-free toolkit for training, evaluation (the BrierLM metric), and sampling [3]. The big win: language becomes a sequence of continuous vectors, reducing generative steps by a factor of K [3].
Industry momentum — Tencent + Tsinghua CALM Tencent and Tsinghua released a CALM paper showing how text latents from an autoencoder can drive a purely continuous generation path; community threads discuss using those latents with diffusion-like denoising and decoding back to text [4].
Memory-focused architectures Discussions highlight architectural strategies to curb peak memory: leaner attention, tiling, and dynamic loading, plus debates on keeping capacity while cutting memory use. The trend points toward memory-efficient LLMs, not just bigger models [5].
Closing thought: the CALM era is accelerating. Keep an eye on Tencent and Tsinghua work and communities like LocalLLaMA for the next memory-friendly leap [4][3].
References
Instead of predicting one token at a time, CALM (Continuous Autoregressive Language Models) predicts continuous vectors that represent multiple tokens at once
Calm: continuous vectors boost efficiency; contrasts open/closed Chinese and Western models; debate on releasing weights.
View sourceTencent + Tsinghua just dropped a paper called Continuous Autoregressive Language Models (CALM)
Post discusses CALM paper, continuous autoregressive modeling via latent diffusion, questions about tokenization, OCR-like latent spaces, and related critiques online
View sourceWhich Architectural Strategies are Set to Reduce Peak Memory Use?
Discusses architectural approaches to lower LLM peak memory—weight reuse, partitioning, dynamic loading, MoEs, efficient attention, and system-wide tradeoffs.
View source