Back to topics

CALM and the race to memory-efficient LLMs: continuous vectors, not tokens

1 min read
191 words
Opinions on LLMs LLMs:

CALM is rewriting how LLMs generate language—no more token-by-token marching. CALM replaces next-token with next-vector predictions, powered by a high-fidelity autoencoder that compresses K tokens into a single vector and reconstructs them with over 99.9% accuracy [3].

Core idea — continuous vectors With CALM, probabilities over continuous spaces can’t be computed by softmax, so researchers built a likelihood-free toolkit for training, evaluation (the BrierLM metric), and sampling [3]. The big win: language becomes a sequence of continuous vectors, reducing generative steps by a factor of K [3].

Industry momentum — Tencent + Tsinghua CALM Tencent and Tsinghua released a CALM paper showing how text latents from an autoencoder can drive a purely continuous generation path; community threads discuss using those latents with diffusion-like denoising and decoding back to text [4].

Memory-focused architectures Discussions highlight architectural strategies to curb peak memory: leaner attention, tiling, and dynamic loading, plus debates on keeping capacity while cutting memory use. The trend points toward memory-efficient LLMs, not just bigger models [5].

Closing thought: the CALM era is accelerating. Keep an eye on Tencent and Tsinghua work and communities like LocalLLaMA for the next memory-friendly leap [4][3].

References

[3]
Reddit

Instead of predicting one token at a time, CALM (Continuous Autoregressive Language Models) predicts continuous vectors that represent multiple tokens at once

Calm: continuous vectors boost efficiency; contrasts open/closed Chinese and Western models; debate on releasing weights.

View source
[4]
Reddit

Tencent + Tsinghua just dropped a paper called Continuous Autoregressive Language Models (CALM)

Post discusses CALM paper, continuous autoregressive modeling via latent diffusion, questions about tokenization, OCR-like latent spaces, and related critiques online

View source
[5]
Reddit

Which Architectural Strategies are Set to Reduce Peak Memory Use?

Discusses architectural approaches to lower LLM peak memory—weight reuse, partitioning, dynamic loading, MoEs, efficient attention, and system-wide tradeoffs.

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started