On-prem LLMs in 2025 are a hardware‑choice sport. The talk kicks off with compact boxes around a single RTX PRO 6000 and drift toward beefier, two‑box rigs. A typical Mini ITX build pairs a Ryzen 9 9900X, ASUS ROG STRIX X870-I GAMING WIFI, and 96GB RAM, landing around €10,000 in France [1].
A cheaper, more plug‑and‑play path is GMKtec’s EVO-T1 vs EVO-X2. The T1 uses an Intel Core Ultra 9 285H CPU and ARC 140T iGPU, while the X2 ships with 128GB RAM and sells around $1999 [2].
Ling and Ring are no longer fringe: Ling and Ring models (1000B/103B/16B) landed in llama.cpp, expanding what you can run locally [3].
Real‑world speeds vary. Some RTX 5090 setups push 180–210 tk/s on a 30B MoE like Qwen3-30B-A3B-Thinking using LM Studio; others hover around 110–130 tk/s on similar rigs [4]. CPU‑only experiments in VPS contexts show you can test small runs with ample RAM, but expect slower throughput unless you trim the model size or context [5].
MoE‑powered speedups are live too. Ring-mini-sparse-2.0-exp uses MoBA to cut KV cache overhead by about 87.5%, delivering up to 3x decode speedups over dense Ring‑mini‑2.0 at 64K context; it targets 128K via YaRN and packs 16B total params with active‑vs‑exhaustion dynamics [7].
Community verdicts in the open‑weights space favor fast, scalable options like GPT OSS-120B and OSS-20B for local inference, underscoring the rapid tooling and model diversity driving 2025’s on‑prem scene [6].
Bottom line: 2025’s on‑prem LLMs range from tiny, quiet boxes to MoE‑heavy powerhouses—the right pick hinges on your workload and budget.
References
Local AI config : Mini ITX single RTX PRO 6000 Workstation for inference ?
Hardware build for multi-user LLMs; compare GPT OSS 120b and Llama 3.3 70b; cost, speed, cloud vs on-prem RAM GPU.
View sourceWhy would I not get the GMKtec EVO-T1 for running Local LLM inference?
Compares EVO-T1 vs X2 for local LLM inference; questions 32B performance; cites memory bandwidth; critiques 15 t/s claims.
View sourceSupport for Ling and Ring models (1000B/103B/16B) has finally been merged into llama.cpp
Discussion of Ling/Ring models (1T/103B/16B), MXFP4 quant, vs UD-Q4_K_XL; speeds, benchmarks, availability, role-playing.
View sourceIs it normal to reach 180-210 tk/s with 30B local LLM?
Discusses token-per-second speeds for Qwen3-30B, MoE vs dense, context length, and GPU/engine factors.
View sourceSmall LLM runs on VPS without GPU
Discusses CPU RAM limits for Tiny LLMs; compares Qwen3-30B-A3B, gpt-oss variants, SmallThinker; tasks include JSON, summarize, date.
View sourceBest Local LLMs - October 2025
Monthly thread discussing open-weight local LLMs, setups, tool use, coding, and RP/text models with diverse performance notes, experiences and preferences.
View sourceRing-mini-sparse-2.0-exp, yet another experimental open source model from inclusionAI that tries to improve performance over long contexts
Discusses Ring-mini-sparse-2.0-exp, MoBA, long-context efficiency, comparisons with Ring-mini-linear-2.0, open weights, no RLHF, benchmarks, hopes performance and math coding science evals.
View source