On-device LLMs vs Cloud: Real-world Evidence of Speed, Privacy, and Ecosystem Tradeoffs

On-device LLMs are finally proving they can compete with cloud AI on speed, privacy, and ecosystems. Real-world tests span a 270M local phishing detector, offline tooling, browser-only RAG, and lightweight models on modest hardware. ^[1]^[3]

Charlemagne Labs built a 270M local model to detect phishing URLs, called Agent Charley. It’s designed to run locally, no cloud needed. ^[1]

ShellAI — Local Terminal Assistance with SLM — showcases offline tooling in a Show HN-style project for hands-on local workflows. ^[2]

WebPizza demonstrates a browser-only RAG pipeline with no backend. Documents never leave your device. Performance numbers are eye-opening: Phi-3 Mini 3‑6 tokens/sec (WebLLM) and 12‑20 tokens/sec (WeInfer); Llama 3.2 1B hits 8‑12 tokens/sec. ^[3]

Stack highlights (all on-device in demos): - WebLLM, WeInfer, Transformers.js, IndexedDB, PDF.js. ^[3]

Gemma 3:4b on low-end devices stays a mixed bag: on an 8GB MacBook Air it can be serviceable, but vision is underwhelming; Gemma 3:12b is noticeably heavier. ^[4]

Llama.cpp vs Ollama experiences show notable consistency gaps. Ollama with GPT-OSS:20B and a MCP Websearch pipeline via n8n works about 90-95% of the time; Llama.cpp with the same setup can be wildly inconsistent. ^[5]

Bottom line: local inference is gaining ground, but cloud still shines in reliability and features—watch how the ecosystem tightens around tooling like GPT-OSS:20B, Llama.cpp, and browser-native RAG as hardware improves.

References

[1]

HackerNews

We built a 270M local model to detect phishing URLs

Prototype builds a 270M local model aimed at detecting phishing URLs, highlighting on-device NLP capability and efficiency advantages over cloud.

View source

[2]

HackerNews

Show HN: ShellAI introduces a local LLM-powered terminal assistant with SLM, enabling offline, privacy-preserving command support for developers and ops

View source

[3]

HackerNews

Proof-of-concept running RAG in browser with WebGPU; supports Phi-3, Llama 3, Mistral 7B; local, no backend; requests feedback.

View source

[4]

Local AI with image input for low end devices?

Attempted Gemma 3:4b with weak vision; Gemma 3:12b slow; Qwen3-VL-Instruct 4b fails on Mac; 2b works but slow; hardware limits.

View source

[5]

Llama.cpp vs Ollama - Same model, parameters and system prompts but VASTLY different experiences

Direct comparison of Llama.cpp and Ollama using GPT-OSS 20B; tuning, templates, and quantization affect results across hardware and settings significantly.

View source

References

We built a 270M local model to detect phishing URLs

Local AI with image input for low end devices?

Llama.cpp vs Ollama - Same model, parameters and system prompts but VASTLY different experiences

Want to track your own topics?