Local-first LLMs are finally hitting real devices, flipping the script on privacy, latency, and cost. 2025 deployments stack ai-core on Android, plus fully local clinicians scribes, and compact, on-device clusters [1]. ai-core offers a single Kotlin/Java interface to load any GGUF or ONNX model and run offline on Android, with multimodal vision and speech support [1].
Offline-First Android: ai-core on-device LLMs - ai-core enables offline inference without cloud or GPU, a privacy-friendly setup for mobile apps [1]. It keeps all compute on-device, with zero-config ease and built-in tooling for telemetry and debugging [1].
Private Clinician Scribe: on-device, on MacOS - On macOS, a fully local on-device AI scribe runs without cloud, anchoring every sentence to the transcript for traceable evidence [2]. The system leans on Foundation Model framework to load adapters and cut down model size, keeping weight small and private data on-device [2].
Three-node Jetson Orin nano cluster: llama.cpp in the wild - A 3-node cluster using Jetson Orin nano boards runs llama.cpp with llama-server and rpc-server; throughput runs around 7 tokens/sec across the trio, vs ~22 tokens/sec on a single node, with a Raspberry Pi 5 as load balancer [4].
AMD-local LLMs: ROCm, Vulkan paths, and speedups - AMD hardware is entering the scene via ROCm and GPU/NPU routes; tools like LM Studio and Lemonade enable local inference, with users reporting 20+ tokens/sec on modest setups [5].
Trade-offs: offline privacy vs cloud power - The offline-vs-cloud debate endures: some argue that 1T+ models still demand cloud or hefty hardware, while local rigs push privacy and zero cloud costs [3].
Bottom line: local-first is no longer fringe—privacy, latency, and cost considerations are reshaping 2025 deployments.
References
😎 Unified Offline LLM, Vision & Speech on Android – ai‑core 0.1 Stable
Discusses ai-core library offering unified API to load GGUF/ONNX models for offline multimodal LLM, vision, speech on Android.
View sourceBuilt a fully local, on-device AI Scribe for clinicians — finally real, finally private
Local on-device clinician scribe; no cloud; evidence anchored; lower hallucination than GPT-5; beta signups for macOS devices free during beta
View sourceYou can turn off the cloud, this + solar panel will suffice:
Discusses offline vs cloud, quantized models, MoE speeds, coding tools; compares Devstral, Magistral, Mistral, Qwen, and others.
View sourceFirst attempt at building a local LLM setup in my mini rack
User builds a three-node Jetson Orin nano cluster for llama.cpp, compares tokens/sec, explores load balancing, and future larger models options.
View sourceAMD Local LLM?
Discusses running local LLMs on AMD hardware; mentions LM Studio, Lemonade, Koboldcpp, ROCm; notes GPU/NPU acceleration and setup challenges ahead.
View source