Back to topics

Speculative decoding: can drafting tokens before you finish truly cut latency—and at what cost?

1 min read
286 words
Opinions on LLMs Speculative

Speculative decoding that drafts tokens while you type sounds sexy, but the practicality is hotly debated. The core idea is two-way drafting: user input tokens and model drafts in parallel, then the final response. Proponents point to prototypes and related papers, but feasibility hinges on architecture and coordination [1].

Feasibility snapshot The concept envisions drafting user next tokens alongside the agent’s drafts in real time. Some prototypes and cited papers hint at potential, but many questions remain about framework and integration with existing chat flows [1].

Latency, throughput, and context Latency isn’t just “faster generation.” Throughput, GPU VRAM, and the model’s context window set hard limits. Discussions flag context windows and rolling context as design challenges, plus the potential gains depend on how well drafts can stay in sync as users type [2]. Offloading caches across sessions, e.g., with LMCache, can help scale for many users [2].

Deployment economics Hardware math is mixed. One thread cites 4x 5090 GPUs delivering about 167 t/s per user, versus 1x RTX PRO 6000 at roughly 60 t/s [2]. On the cost side, API options like Claude 4.5 Sonnet, GLM 4.6, Qwen3 Next Instruct, Deepseek 3.1, and GPT-OSS-120B illustrate tradeoffs in price and latency [3]. Some propose renting H100s for short bursts when training or adapting a model, then serving from local hardware [3].

Risks and tradeoffs The debate centers on whether latency wins justify architecture complexity and potential reliability issues. The idea isn’t dismissed, but many think it may require a radically different approach or framework to pull off in production [1].

Closing thought: two-way speculative decoding remains intriguing, but its real-world sweet spot will hinge on tangible latency gains without sacrificing accuracy or simplicity.

References

[1]
Reddit

Would it be theoretically possible to create a two-way speculative decoder to infer the user's next token while they're typing and generate the LLM's draft tokens in real-time before the user finishes then finalize the response once sent?

Discusses speculative decoding to reduce latency by drafting user and agent tokens in real-time; considers architecture, sampling, and latency tradeoffs

View source
[2]
Reddit

Nice LLM calculator

Discussion of GPU VRAM limits, context windows, and model density affecting LLM throughput; Gemma 3 27B and Pro 6000 comparisons.

View source
[3]
Reddit

LLM recomendation

Discusses local GPUs vs API costs, model options, throughput, RAM, and LoRA training for structured outputs data extraction and automation.

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started