Open-Source Local LLMs Deliver on Latency and Cost: A 2025 Reality Check

On-device LLMs aren’t just plausible anymore—they’re practical. The chatter highlights llama.cpp driving Pure Go hardware-accelerated local inference on VLMs ^[1]. If you want price-per-token on your desk, that local stack is real.

RTX 3060 dual-GPU setups push ~30 tokens/second on 8B models, with a 145W power limit guiding steady, throttling-free ramps ^[2]. The thread also stresses budget-conscious optimizations like heavy Ollama tuning.
Qwen3 model family shows how memory matters: 8B at 6GB VRAM clocks about 56 tokens/second, while 30B at 18GB VRAM nudges up to ~78 tokens/second ^[2]. The split across GPUs underlines how VRAM and threading tradeoffs drive throughput.
Modal demonstrates 1-second voice-to-voice latency across open models, signaling strong stacks for real-time audio apps ^[3].
GPT-OSS 120B on 4x RTX 3090 runs with vLLM and MXFP4 in a working Dockerfile example, showing accessible, scalable on-device deployment even for very large models ^[4].
Offline/offline-use cases live beyond labs: practical edge scenarios on Android devices and projects like ToolNeuron point to robust, offline-first AI experiences ^[5].

Bottom line: open stacks—from llama.cpp to GPT-OSS 120B—are delivering real latency and cost wins, with edge deployments moving from curiosity to everyday reality.

References

[1]

HackerNews

Pure Go hardware accelerated local inference on VLMs using llama.cpp

Go-based local inference on VLMs using llama.cpp, hardware-accelerated, project link

View source

[2]

Follow-up to the my Dual-RTX 3060 build (originally posted on r/Ollama): Now hitting 30 t/s on 8b models using 145W Power Limiting!

Discusses dual RTX 3060 setup comparing 8b vs 30b Qwen/Qwen3 models, 145W power limit, VRAM usage, and throughput benchmarks.

View source

[3]

HackerNews

1 second voice-to-voice latency with all open models

Discusses achieving 1s latency in voice-to-voice use with all open models; focus on open LLMs and real-time voice applications today.

View source

[4]

Working Dockerfile for gpt-oss-120b on 4x RTX 3090 (vLLM + MXFP4)

User seeks Dockerfile for GPT-OSS 120B on 4x3090; debates vLLM versions, tensor parallelism, image choices, and compatibility with Triton kernel.

View source

[5]

Any Suggestions for Running Ai Models Completely Offline

Discussion of offline LLM use on edge devices, sizes from 0.5B to 12B, apps and frameworks, pros, cons, privacy concerns.

View source

References

Pure Go hardware accelerated local inference on VLMs using llama.cpp

Follow-up to the my Dual-RTX 3060 build (originally posted on r/Ollama): Now hitting 30 t/s on 8b models using 145W Power Limiting!

1 second voice-to-voice latency with all open models

Working Dockerfile for gpt-oss-120b on 4x RTX 3090 (vLLM + MXFP4)

Any Suggestions for Running Ai Models Completely Offline

Want to track your own topics?