Back to topics

Speed, Hardware, and Realistic Expectations for Local Inference

1 min read
286 words
Opinions on LLMs Speed, Hardware,

Speed, hardware realities, and context length collide in local LLMs. A post shows a Qwen3-30B-A3B-Thinking run on a RTX 5090 hitting 180-210 tk/s, while others see 110-130 tk/s on the same setup—sparking questions about what actually sets pace [1].

Speed snapshots • On the same setup, 10k context can push toward ~173 tk/s; around 6k context folks report ~173, with speeds drifting as context grows [1].

CPU vs GPU offload reality • Some users debounce GPU offloads or run CPU-only; a common tip is to disable GPU offloading entirely by setting offloaded layers to 0, which for some configs can improve latency [2]. • Intel’s IPEX-LLM path is brought up for CPU-friendly inference on certain architectures [2].

CPU-only or VPS viability • For CPU/VPS like 32GB RAM laptops, Qwen3-30B-A3B may demand about 55GB RAM at 256k context; GPT-OSS-120B can spike toward ~64GB, while GPT-OSS-20B sits around ~15GB for full context [3]. • If you’re chasing a quick, low-cost test, smaller models and CPU mode can be a practical entry point on modest VPS setups [3].

MoE and long-context advances - Ling-mini-2.0-MXFP4MOE-GGUF and Ring-mini-2.0-MXFP4MOE-GGUF are MOE-based tiny- yet- mighty options; MXFP4 quantization can be faster on compatible hardware and keep long contexts (128K via YaRN) while keeping active params around 1.6B in a 16B model [4]. - Ring-mini-sparse-2.0-exp uses MoBA to slash KV cache overhead (~8K tokens/query at 64K context), delivering up to 3x decode speedup over dense equivalents and still aiming for solid reasoning in long chats (128K YaRN) [5].

Closing thought: the real-world answer to speed isn’t a single setting—it's a dance among model size, quantization, context, and your hardware. Stay tuned for MoE-long-context wins and new VPS-friendly paths.

References

[1]
Reddit

Is it normal to reach 180-210 tk/s with 30B local LLM?

Discusses token-per-second speeds for Qwen3-30B, MoE vs dense, context length, and GPU/engine factors.

View source
[2]
Reddit

Very slow response on gwen3-4b-thinking model on LM Studio. I need help

User compares CPU-only vs GPU offloading, recommends smaller, quantized LLMs and non-thinking variants for coding assistance in VSCode on PC.

View source
[3]
Reddit

Small LLM runs on VPS without GPU

Discusses CPU RAM limits for Tiny LLMs; compares Qwen3-30B-A3B, gpt-oss variants, SmallThinker; tasks include JSON, summarize, date.

View source
[4]
Reddit

Support for Ling and Ring models (1000B/103B/16B) has finally been merged into llama.cpp

Discussion of Ling/Ring models (1T/103B/16B), MXFP4 quant, vs UD-Q4_K_XL; speeds, benchmarks, availability, role-playing.

View source
[5]
Reddit

Ring-mini-sparse-2.0-exp, yet another experimental open source model from inclusionAI that tries to improve performance over long contexts

Discusses Ring-mini-sparse-2.0-exp, MoBA, long-context efficiency, comparisons with Ring-mini-linear-2.0, open weights, no RLHF, benchmarks, hopes performance and math coding science evals.

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started