Efficiency and On-Prem AI: Post-NAS Models, DP-TP Tradeoffs, and DIY LLM Servers

Jet-Nemotron, a post-NAS model, is shaking up on-prem AI efficiency. Its emergence highlights a practical path for production-ready LLMs that don’t rely on cloud-scale hardware. ^[1]

Benchmarks show a dual RTX 5090 with vLLM and unquantized Gemma-3-12b delivering higher throughput than a single card. With 2x RTX 5090 at higher concurrency, total token throughput hits about 3749 tok/s, vs around 2543 tok/s on 1x RTX 5090. ^[2]

DIY LLM server nerd-out: here’s what folks weigh as they plan on-prem rigs.

• GPU options: RTX 5090 vs 7900XTX — power, price, and PCIe considerations drive the choice; the 5090’s power connector and heat are real hurdles, while the 7900XTX looks cheaper but offers different performance. ^[3] • Software choices: Open WebUI, llama.cpp, and Ollama are common wrappers and runtimes for local inference. ^[3] • Linux and drivers: aim for a stable Linux distro that plays well with Nvidia drivers. ^[3] • Storage basics: debate between RAID0 with multiple SSDs or a single PCIe 5.0 drive for load times. ^[3] • Gemma 27b reality check: you’ll hear warnings that Gemma 27b may seem enough now but can become a bottleneck quickly. ^[3]

Bottom line: the on-prem AI landscape is maturing fast—efficiency-focused models, explicit DP-TP tradeoffs, and pragmatic hardware/software choices are all in play for real deployments. ^[1]^[2]^[3]

References

[1]

HackerNews

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Presents Jet-Nemotron, an efficient language model designed with post neural architecture search.

View source

[2]

Benchmarked 2x 5090 with vLLM and Gemma-3-12b unquantized

Dual-5090 benchmarks with vLLM and Gemma-3-12b unquantized; compares DP vs TP, throughput, VRAM, and multi-user viability production scenarios in practice.

View source

[3]

Need some advice on building a dedicated LLM server

Discusses local LLM server build: GPU choice, CPU/mobo, storage, Linux, Open WebUI, llama.cpp, Ollama, RAID, cloud alternatives, model size debate.

View source

References

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

Benchmarked 2x 5090 with vLLM and Gemma-3-12b unquantized

Need some advice on building a dedicated LLM server

Want to track your own topics?