The Local-First LLMs Movement: How Enthusiasts Are Pushing Privacy and Low-Latency with On-Device and Self-Hosted Solutions

The Local-First LLMs movement is heating up, with Russet running completely on-device using Apple Intelligence's foundation model. Everything is private and offline—no internet needed, no account, no tracking, and conversations stay on your device ^[1].

On-device privacy wins - Russet shows privacy-first on-device chat; all processing happens locally ^[1]. - Ollama enables self-hosted AI agents on your own hardware, with egress controls and audit logging to avoid third-party data sharing ^[2]. - Qwen 1.5B runs fully on the Jetson Orin Nano with no cloud, delivering ~30 tokens/sec and under 10W power ^[4]. - Emu3.5 (Open-Source World Learner) matches Gemini 2.5 Flash performance while operating entirely on local hardware ^[3]. - Qwen2.5-VL offers real-time, local surgical video understanding on-device ^[3]. - Distil-expenses shows two Llama 3.2 models you can run locally via Ollama for personal finance tasks ^[5]. - Granite-nano and related tools highlighted in IBM’s local-copilot workflow underscore privacy-by-default on edge setups ^[6].

Local-edge setups aren’t just sci-fi—people are testing these on consumer hardware and dedicated devices alike. The chatter spans weekly roundups like Last Week in Multimodal AI Local Edition, with multi-modal models and edge-friendly designs leading the way ^[3].

Closing thought: the draw of pure-local AI is strong—privacy, latency, and control come first, even as cloud-backed options still push scale. Expect more hands-on, on-device experiments to ripple into mainstream toolchains ^[3]^[4].

References

[1]

HackerNews

Show HN Russet uses Apple's on-device foundation model for private, offline AI chat; runs locally with privacy and no tracking.

View source

[2]

Self-hosted platform for running third-party AI agents with Ollama support (Apache-2.0)

Open-source platform to run third-party AI agents locally with Ollama, focusing on privacy, controlled egress, and auditability.

View source

[3]

Last week in Multimodal AI - Local Edition

Local edition highlights edge-friendly multimodal models (Emu3.5, Qwen2.5-VL, ChronoEdit, Wan2GP, LongCat-Flash-Omni, Ming-flash-omni) with local/offline use.

View source

[4]

Running Qwen 1.5B Fully On-Device on Jetson Orin Nano - No Cloud, Under 10W Power

User shares local on-device Qwen 1.5B on Jetson Orin Nano, no cloud, low power; discusses speed, tasks, and comparisons.

View source

[5]

We trained SLM-powered assistants for personal expenses summaries that you can run locally via Ollama.

Compares locally run Llama 3.2 SLMs to GPT-OSS, showing distillation gains and tool calling limits in a personal expenses demo.

View source

[6]

IBM Developer - Setting up local co-pilot using Ollama with VS Code (or VSCodium for no telemetry air-gapped) with Continue extension.

Discusses offline/local coding assistants (Granite-Qwen) with private deployment, comparing performance, hardware requirements, privacy, maintenance, and overall cost against cloud providers.

View source

References

Self-hosted platform for running third-party AI agents with Ollama support (Apache-2.0)

Last week in Multimodal AI - Local Edition

Running Qwen 1.5B Fully On-Device on Jetson Orin Nano - No Cloud, Under 10W Power

We trained SLM-powered assistants for personal expenses summaries that you can run locally via Ollama.

IBM Developer - Setting up local co-pilot using Ollama with VS Code (or VSCodium for no telemetry air-gapped) with Continue extension.

Want to track your own topics?