The breakthrough comes from N Nvidia-led research showing NVFP4 can reach FP8-level accuracy in 4-bit pretraining on a Mamba Transformer with 12B parameters trained on 10T tokens. The result? memory and compute cut without paying a big accuracy tax. [1]
Mechanics matter: NVFP4 uses a shared FP8 (E4M3) scaling factor across a 16-value micro-block, with a tensor-wide FP32 scaling factor. You can't have 2000 and 0.001 in the same micro-block, but the numbers still dance. Training stays within 1% of FP8 for most of the run, rising to about 1.5% late in learning-rate decay. On-task scores stay near parity: MMLU Pro 62.58% vs 62.62%; MBPP+ 55.91% vs 59.11%. [1]
Deployment implications are real. Quantization choices swing what fits in cloud budgets versus local rigs. For local inference, docker/model-runner provides a backend-agnostic way to download and run local LLMs, with llama.cpp as a backbone and the ability to ship models via OCI registries. The project now adds Vulkan and AMD support, a monorepo to ease contributions, and an open-source Apache 2.0 license, plus DGX Spark day-0 support. [2]
Bottom line: 4-bit training is catching up to FP8 fast, and pragmatic toolchains are catching up to local-first workflows. Watch how this blend fuels more on-device options and tighter cloud-local parity.
References
Nvidia breakthrough gives 4-bit pretraining technique the accuracy of FP8
Discussion contrasts FP8 and 4-bit NVFP4 training for LLMs; surveys accuracy, efficiency, scaling, quantization, and hardware implications.
View sourceShow HN: docker/model-runner – an open-source tool for local LLMs
Docker model-runner offers backend-agnostic local LLMs, Vulkan support, monorepo, OCI model transport; seeking community feedback and improvements from contributors worldwide.
View source