Hardware choices are shaping LLM speed and deployment more than ever. Clustering Nvidia DGX Spark with the M3 Ultra Mac Studio is said to push inference about 4x, turning rack pace into real-time-ish speeds on some models [1].
- TabbyAPI on a 4×3090 baseline runs around 12 tokens/s; 4×3090 bumps to about 17.9 tokens/s [2].
- SGLang on 4×3090 hits ~32 tokens/s, +167% over the baseline [2].
- With NVLink, 4×3090 climbs to 36–37 tokens/s, +200% [2].
- Adding Torch.compile on NVLink brings 37.1 tokens/s, +209% [2].
- SGLang with AWQ + Torch.compile on 4×3090 nails 61.5 tokens/s (8-bit KAT-Dev 32B), +66% vs Mistral [2].
- On a single RTX 2000 Ada, vLLM runs Gemma 12B at 20–21 tokens/s; SGLang AWQ + Torch.compile climbs to 23.4–23.8 tokens/s (+15–18%) [2].
These tweaks—NVLink, Torch.compile, and quantization—show how multi-GPU setups can dwarf single-card results, even with the same model family [2].
In OCR tests, PaddleOCR-VL outperforms private models when fed 1080p images; 4k imagery often misses text, a reminder that input quality and VL capabilities matter [3].
On coding models, Ollama covers running off a server while llama.cpp and GGML push hosting and backend options; attribution debates highlight the messy, fast-moving nature of tooling [4].
A separate piece digs into training an LLM from scratch, weighing the upfront effort and costs as you scale hardware and clusters [5].
Bottom line: speed, cost, and deployment shape your path—from local multi-GPU rigs to cloud-ready OCR and coding-tooling stacks. Watch how NVLink, quantization, and on-device options evolve next.
References
Clustering Nvidia DGX Spark and M3 Ultra Mac Studio for 4x Faster LLM Inference
Post discusses clustering Nvidia DGX Spark and M3 Ultra Mac Studio to achieve about four times faster LLM inference
View sourceSGLang vs TabbyAPI & vLLM Benchmark Increases (Multi-GPU + Single-GPU)
Benchmarks compare SGLang with TabbyAPI and vLLM across multi-GPU systems, highlighting speedups, quantization, NVLink, virtualization, and power tradeoffs in practice.
View sourcePaddleOCR-VL, is better than private models
PaddleOCR praised; peers compare Qwen3-VL, Gemini, Mistral; discusses vertical text, handwriting, testing, GPU issues, and model strengths versus others overall.
View sourceNew coding models and integrations
Discussion covers llama.cpp, Ollama, GGML/GGUF, GLM-4.6, Claude, Qwen3-Coder, and local vs cloud setups, with price, attribution, and performance notes throughout.
View sourceWriting an LLM from scratch, part 22 – training our LLM
A series on building an LLM from scratch; discusses intuition, code access, cost tradeoffs, and references Karpathy and Manning book.
View source