Edge AI in the Real World: From 60k-GPU Local Rigs to 16GB Phones and Self-Hosted Speech

Edge AI isn’t just a cloud dream anymore. Real-world setups show local LLMs and on-device inference maturing—from a $60k, GPU-packed rig to 16GB Android phones—and even self-hosted speech stacks that sidestep data clouds.

Edge Rig for Local LLMs A local-LLM build stacks Mac Studio M3 Ultra (512GB RAM, 4TB HDD) with a 96-core Threadripper CPU and four RTX Pro 6000 Max Q GPUs, all claimed under a ~$60k budget. Power is real too: a 110V/30A install and dual PSUs keep the node humming ^[1].

On-Device Android AI On the mobile front, Gemma 3 12B QAT q4_0 runs on a OnePlus 12 (5548 FP16 GFLOPS, 76.8GB/s) via MNN-LLM and ChatterUI, delivering around 11 T/s for prompts and 9-10 T/s for inference in practice ^[2]. Battery heating is a concern; some users discuss passthrough charging to separate device load from the battery ^[2].

Local Speech-to-Speech & Self-Hosted Stacks For fully local, self-hosted speech-to-speech, a growing ecosystem lists: - Unmute.sh — Linux, cascading - Ultravox (Fixie) — Windows/Linux, hybrid UIs - RealtimeVoiceChat — Linux-friendly, pluggable LLM - Vocalis — macOS/Windows/Linux, tool calling via backend LLM - LFM2 — Windows/Linux, built-in LLM + tool calling - Mini-omni2 — cross-platform - Pipecat — Windows/macOS/Linux/iOS/Android, explicit tool calling ^[3]

Vocalis also relies on Kokoro-FastAPI for TTS, and Koboldcpp can bridge to kokoro flows ^[3].

The takeaway: edge setups are shifting from curiosity to privacy-preserving, low-latency reality—and the options keep multiplying.

References

[1]

New Build for local LLM

User shares 60k local LLM rig with multiple GPUs; discusses models, quantization, latency, power, and comparisons to Mac Studio.

View source

[2]

Anyone running llm on their 16GB android phone?

Discusses running Gemma-3-12B and Qwen-3 on 16GB phones, speeds, hardware limits, swap, battery, cooling, and app performance benchmarks.

View source

[3]

Awesome Local LLM Speech-to-Speech Models & Frameworks

Curates local speech-to-speech LLMs; compares cascading vs end-to-end, tool calling, and self-hosted setups.

View source

References

New Build for local LLM

Anyone running llm on their 16GB android phone?

Awesome Local LLM Speech-to-Speech Models & Frameworks

Want to track your own topics?