Back to topics

On-device LLMs in 2025: from 12GB VRAM struggles to 96GB heat maps—is local inference finally viable?

1 min read
248 words
Opinions on LLMs On-device

Local LLMs are finally delivering a credible on-device story. In 2025 threads, enthusiasts push 4070 OC-class GPUs on 12GB VRAM and chase usable MOE models for coding and daily tasks [1].

Running on 12GB VRAM On 12GB VRAM, the conversation centers on squeezing speed with MOE models and watching context sizes creep upward. The takeaway: you can squeeze decent work, but you’re trading latency and scale [1].

Free KV-cache memory To free KV-cache memory, the kvcached library lets local inference engines like SGLang and vLLM drop idle KV caches instead of hogging VRAM. It’s pip-installable, with early support for Ollama and LM Studio in progress [2].

Offline resilience & hardware options If cloud disappears, heavy hitters turn into a conversation about offline rigs. People point to DGX Spark and DGX Station as offline powerhouses, with references to personal AI hardware and related pricing chatter from NVIDIA coverage [3].

Memory upgrades & deals Deal news: Ryzen 395 w/ 128GB RAM is now ~€1581 in Europe via the Bosgame M5 line, making memory-heavy setups a touch more affordable [4].

96GB paths for MOE/offload on Pro GPUs On a 96G RTX Pro 6000 Blackwell, the field leans toward GLM-4.5-Air-GGUF and Qwen3-235B-A22B-GGUF, with gpt-oss-120b also in the mix. Either vLLM or TensorRT-LLM handles them well; enable NVFP4 for FP8, and push large contexts (some MOE builds claim up to 128k) [5].

Bottom line: local inference is finally viable for power users who chase MOE/offload paths and keep an eye on hardware deals.

References

[1]
Reddit

What are your favorite models to run on 12gb vram (4070 oc)

Discusses ideal local LLMs for 12GB VRAM, hardware upgrades, MOE models, and large models feasibility (e.g., GPT OSS 120b) today

View source
[2]
Reddit

Free GPU memory during local LLM inference without KV cache hogging VRAM

Announcement of kvcached to free KV cache in local LLMs; supports SGLang, vLLM; llama.cpp, multi-agent use discussed and feedback invited.

View source
[3]
Reddit

Assume LLM becomes illegal to offer to consumers tomorrow, What will YOU personally be able to run?

Post debates running local LLMs without cloud, sharing hardware specs, model sizes, MoE tricks, and regulatory concerns about cost overall.

View source
[4]
Reddit

Deal on Ryzen 395 w/ 128GB, now 1581€ in Europe

Discusses Bosgame M5 Ryzen 395 with 128GB RAM for local LLMs; reviews, ROCm/Vulkan support, power, taxes, and price, user experiences.

View source
[5]
Reddit

Best LLM for 96G RTX Pro 6000 Blackwell?

Examines best models fitting VRAM on 96G RTX Pro 6000 Blackwell; notes MOEs, offloading, speed, tools

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started