On-device LLMs aren’t just plausible anymore—they’re practical. The chatter highlights llama.cpp driving Pure Go hardware-accelerated local inference on VLMs [1]. If you want price-per-token on your desk, that local stack is real.
RTX 3060 dual-GPU setups push ~30 tokens/second on 8B models, with a 145W power limit guiding steady, throttling-free ramps [2]. The thread also stresses budget-conscious optimizations like heavy Ollama tuning.
Qwen3 model family shows how memory matters: 8B at 6GB VRAM clocks about 56 tokens/second, while 30B at 18GB VRAM nudges up to ~78 tokens/second [2]. The split across GPUs underlines how VRAM and threading tradeoffs drive throughput.
Modal demonstrates 1-second voice-to-voice latency across open models, signaling strong stacks for real-time audio apps [3].
GPT-OSS 120B on 4x RTX 3090 runs with vLLM and MXFP4 in a working Dockerfile example, showing accessible, scalable on-device deployment even for very large models [4].
Offline/offline-use cases live beyond labs: practical edge scenarios on Android devices and projects like ToolNeuron point to robust, offline-first AI experiences [5].
Bottom line: open stacks—from llama.cpp to GPT-OSS 120B—are delivering real latency and cost wins, with edge deployments moving from curiosity to everyday reality.
References
Pure Go hardware accelerated local inference on VLMs using llama.cpp
Go-based local inference on VLMs using llama.cpp, hardware-accelerated, project link
View sourceFollow-up to the my Dual-RTX 3060 build (originally posted on r/Ollama): Now hitting 30 t/s on 8b models using 145W Power Limiting!
Discusses dual RTX 3060 setup comparing 8b vs 30b Qwen/Qwen3 models, 145W power limit, VRAM usage, and throughput benchmarks.
View source1 second voice-to-voice latency with all open models
Discusses achieving 1s latency in voice-to-voice use with all open models; focus on open LLMs and real-time voice applications today.
View sourceWorking Dockerfile for gpt-oss-120b on 4x RTX 3090 (vLLM + MXFP4)
User seeks Dockerfile for GPT-OSS 120B on 4x3090; debates vLLM versions, tensor parallelism, image choices, and compatibility with Triton kernel.
View sourceAny Suggestions for Running Ai Models Completely Offline
Discussion of offline LLM use on edge devices, sizes from 0.5B to 12B, apps and frameworks, pros, cons, privacy concerns.
View source