On-device LLMs are moving from novelty to practicality. A Reddit thread shows someone trying to run unsloth/Phi-3-mini-4k-instruct-bnb-4bit locally to read a dataframe column, generate a clean summary, and parse fields per row. The setup is modest: a Ryzen 5 5500U with 8GB RAM and integrated graphics, with tweaks via Google Colab to fine-tune first with Unsloth.
Reality on modest hardware — The goal is clear: tidy, row-by-row processing without cloud gnats. The 4-bit quantized model helps squeeze into limited memory, but latency and dynamic shapes still loom large on such rigs.
Paths to on-device viability - Low-effort path — convert to a JAX-compatible checkpoint and run with TPU tooling; static graphs tend to perform better, making local inference more tractable [2]. - High-effort path — stay with PyTorch but wrestle with torch-xla/XLA quirks, use a manual decode loop with a per-token step, and avoid dynamic recompiles; if you must use generate(), chunk and call xm.mark_step() between chunks [2]. - There’s also talk of server-based routes or alternative stacks, though the quick win leans heavily toward the JAX route [2].
Design levers for local deployment — Favor static shapes, fixed max tokens, and careful memory budgeting. The takeaway: local, low-resource LLMs are possible, but the stack choice matters as much as the model [1][2].
Closing thought: the on-device path is real, but the right toolkit makes all the difference. Watch for new quantization tricks and smarter runtimes in the next few updates.
References
[P] Model needs to be deployed
Fine-tuning Unsloth Phi-3 mini; seeks local deployment to process dataframe rows and output summaries and parsed fields on low resources.
View source[D] LLM Inference on TPUs
Discusses LLM inference on TPUs, comparing PyTorch XLA vs JAX paths; suggests JAX for simplicity, PyTorch with manual loops workarounds.
View source