On-device LLMs are finally proving they can compete with cloud AI on speed, privacy, and ecosystems. Real-world tests span a 270M local phishing detector, offline tooling, browser-only RAG, and lightweight models on modest hardware. [1][3]
Charlemagne Labs built a 270M local model to detect phishing URLs, called Agent Charley. It’s designed to run locally, no cloud needed. [1]
ShellAI — Local Terminal Assistance with SLM — showcases offline tooling in a Show HN-style project for hands-on local workflows. [2]
WebPizza demonstrates a browser-only RAG pipeline with no backend. Documents never leave your device. Performance numbers are eye-opening: Phi-3 Mini 3‑6 tokens/sec (WebLLM) and 12‑20 tokens/sec (WeInfer); Llama 3.2 1B hits 8‑12 tokens/sec. [3]
Stack highlights (all on-device in demos): - WebLLM, WeInfer, Transformers.js, IndexedDB, PDF.js. [3]
Gemma 3:4b on low-end devices stays a mixed bag: on an 8GB MacBook Air it can be serviceable, but vision is underwhelming; Gemma 3:12b is noticeably heavier. [4]
Llama.cpp vs Ollama experiences show notable consistency gaps. Ollama with GPT-OSS:20B and a MCP Websearch pipeline via n8n works about 90-95% of the time; Llama.cpp with the same setup can be wildly inconsistent. [5]
Bottom line: local inference is gaining ground, but cloud still shines in reliability and features—watch how the ecosystem tightens around tooling like GPT-OSS:20B, Llama.cpp, and browser-native RAG as hardware improves.
References
We built a 270M local model to detect phishing URLs
Prototype builds a 270M local model aimed at detecting phishing URLs, highlighting on-device NLP capability and efficiency advantages over cloud.
View sourceShow HN: ShellAI introduces a local LLM-powered terminal assistant with SLM, enabling offline, privacy-preserving command support for developers and ops
View sourceProof-of-concept running RAG in browser with WebGPU; supports Phi-3, Llama 3, Mistral 7B; local, no backend; requests feedback.
View sourceLocal AI with image input for low end devices?
Attempted Gemma 3:4b with weak vision; Gemma 3:12b slow; Qwen3-VL-Instruct 4b fails on Mac; 2b works but slow; hardware limits.
View sourceLlama.cpp vs Ollama - Same model, parameters and system prompts but VASTLY different experiences
Direct comparison of Llama.cpp and Ollama using GPT-OSS 20B; tuning, templates, and quantization affect results across hardware and settings significantly.
View source