Back to topics

On-Prem LLMs in 2025: Hardware, Open-Source Options, and Real-World Performance Claims

1 min read
258 words
Opinions on LLMs On-Prem Hardware,

On-prem LLMs in 2025 are a hardware‑choice sport. The talk kicks off with compact boxes around a single RTX PRO 6000 and drift toward beefier, two‑box rigs. A typical Mini ITX build pairs a Ryzen 9 9900X, ASUS ROG STRIX X870-I GAMING WIFI, and 96GB RAM, landing around €10,000 in France [1].

A cheaper, more plug‑and‑play path is GMKtec’s EVO-T1 vs EVO-X2. The T1 uses an Intel Core Ultra 9 285H CPU and ARC 140T iGPU, while the X2 ships with 128GB RAM and sells around $1999 [2].

Ling and Ring are no longer fringe: Ling and Ring models (1000B/103B/16B) landed in llama.cpp, expanding what you can run locally [3].

Real‑world speeds vary. Some RTX 5090 setups push 180–210 tk/s on a 30B MoE like Qwen3-30B-A3B-Thinking using LM Studio; others hover around 110–130 tk/s on similar rigs [4]. CPU‑only experiments in VPS contexts show you can test small runs with ample RAM, but expect slower throughput unless you trim the model size or context [5].

MoE‑powered speedups are live too. Ring-mini-sparse-2.0-exp uses MoBA to cut KV cache overhead by about 87.5%, delivering up to 3x decode speedups over dense Ring‑mini‑2.0 at 64K context; it targets 128K via YaRN and packs 16B total params with active‑vs‑exhaustion dynamics [7].

Community verdicts in the open‑weights space favor fast, scalable options like GPT OSS-120B and OSS-20B for local inference, underscoring the rapid tooling and model diversity driving 2025’s on‑prem scene [6].

Bottom line: 2025’s on‑prem LLMs range from tiny, quiet boxes to MoE‑heavy powerhouses—the right pick hinges on your workload and budget.

References

[1]
Reddit

Local AI config : Mini ITX single RTX PRO 6000 Workstation for inference ?

Hardware build for multi-user LLMs; compare GPT OSS 120b and Llama 3.3 70b; cost, speed, cloud vs on-prem RAM GPU.

View source
[2]
Reddit

Why would I not get the GMKtec EVO-T1 for running Local LLM inference?

Compares EVO-T1 vs X2 for local LLM inference; questions 32B performance; cites memory bandwidth; critiques 15 t/s claims.

View source
[3]
Reddit

Support for Ling and Ring models (1000B/103B/16B) has finally been merged into llama.cpp

Discussion of Ling/Ring models (1T/103B/16B), MXFP4 quant, vs UD-Q4_K_XL; speeds, benchmarks, availability, role-playing.

View source
[4]
Reddit

Is it normal to reach 180-210 tk/s with 30B local LLM?

Discusses token-per-second speeds for Qwen3-30B, MoE vs dense, context length, and GPU/engine factors.

View source
[5]
Reddit

Small LLM runs on VPS without GPU

Discusses CPU RAM limits for Tiny LLMs; compares Qwen3-30B-A3B, gpt-oss variants, SmallThinker; tasks include JSON, summarize, date.

View source
[6]
Reddit

Best Local LLMs - October 2025

Monthly thread discussing open-weight local LLMs, setups, tool use, coding, and RP/text models with diverse performance notes, experiences and preferences.

View source
[7]
Reddit

Ring-mini-sparse-2.0-exp, yet another experimental open source model from inclusionAI that tries to improve performance over long contexts

Discusses Ring-mini-sparse-2.0-exp, MoBA, long-context efficiency, comparisons with Ring-mini-linear-2.0, open weights, no RLHF, benchmarks, hopes performance and math coding science evals.

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started