Speculative decoding that drafts tokens while you type sounds sexy, but the practicality is hotly debated. The core idea is two-way drafting: user input tokens and model drafts in parallel, then the final response. Proponents point to prototypes and related papers, but feasibility hinges on architecture and coordination [1].
Feasibility snapshot The concept envisions drafting user next tokens alongside the agent’s drafts in real time. Some prototypes and cited papers hint at potential, but many questions remain about framework and integration with existing chat flows [1].
Latency, throughput, and context Latency isn’t just “faster generation.” Throughput, GPU VRAM, and the model’s context window set hard limits. Discussions flag context windows and rolling context as design challenges, plus the potential gains depend on how well drafts can stay in sync as users type [2]. Offloading caches across sessions, e.g., with LMCache, can help scale for many users [2].
Deployment economics Hardware math is mixed. One thread cites 4x 5090 GPUs delivering about 167 t/s per user, versus 1x RTX PRO 6000 at roughly 60 t/s [2]. On the cost side, API options like Claude 4.5 Sonnet, GLM 4.6, Qwen3 Next Instruct, Deepseek 3.1, and GPT-OSS-120B illustrate tradeoffs in price and latency [3]. Some propose renting H100s for short bursts when training or adapting a model, then serving from local hardware [3].
Risks and tradeoffs The debate centers on whether latency wins justify architecture complexity and potential reliability issues. The idea isn’t dismissed, but many think it may require a radically different approach or framework to pull off in production [1].
Closing thought: two-way speculative decoding remains intriguing, but its real-world sweet spot will hinge on tangible latency gains without sacrificing accuracy or simplicity.
References
Would it be theoretically possible to create a two-way speculative decoder to infer the user's next token while they're typing and generate the LLM's draft tokens in real-time before the user finishes then finalize the response once sent?
Discusses speculative decoding to reduce latency by drafting user and agent tokens in real-time; considers architecture, sampling, and latency tradeoffs
View sourceNice LLM calculator
Discussion of GPU VRAM limits, context windows, and model density affecting LLM throughput; Gemma 3 27B and Pro 6000 comparisons.
View sourceLLM recomendation
Discusses local GPUs vs API costs, model options, throughput, RAM, and LoRA training for structured outputs data extraction and automation.
View source