Back to topics

Hardware and Deployment Realities: Inference Clusters, GPUs, and Multi-GPU Benchmarks

1 min read
261 words
Opinions on LLMs Hardware Deployment

Hardware choices are shaping LLM speed and deployment more than ever. Clustering Nvidia DGX Spark with the M3 Ultra Mac Studio is said to push inference about 4x, turning rack pace into real-time-ish speeds on some models [1].

  • TabbyAPI on a 4×3090 baseline runs around 12 tokens/s; 4×3090 bumps to about 17.9 tokens/s [2].
  • SGLang on 4×3090 hits ~32 tokens/s, +167% over the baseline [2].
  • With NVLink, 4×3090 climbs to 36–37 tokens/s, +200% [2].
  • Adding Torch.compile on NVLink brings 37.1 tokens/s, +209% [2].
  • SGLang with AWQ + Torch.compile on 4×3090 nails 61.5 tokens/s (8-bit KAT-Dev 32B), +66% vs Mistral [2].
  • On a single RTX 2000 Ada, vLLM runs Gemma 12B at 20–21 tokens/s; SGLang AWQ + Torch.compile climbs to 23.4–23.8 tokens/s (+15–18%) [2].

These tweaks—NVLink, Torch.compile, and quantization—show how multi-GPU setups can dwarf single-card results, even with the same model family [2].

  • In OCR tests, PaddleOCR-VL outperforms private models when fed 1080p images; 4k imagery often misses text, a reminder that input quality and VL capabilities matter [3].

  • On coding models, Ollama covers running off a server while llama.cpp and GGML push hosting and backend options; attribution debates highlight the messy, fast-moving nature of tooling [4].

  • A separate piece digs into training an LLM from scratch, weighing the upfront effort and costs as you scale hardware and clusters [5].

Bottom line: speed, cost, and deployment shape your path—from local multi-GPU rigs to cloud-ready OCR and coding-tooling stacks. Watch how NVLink, quantization, and on-device options evolve next.

References

[1]
HackerNews

Clustering Nvidia DGX Spark and M3 Ultra Mac Studio for 4x Faster LLM Inference

Post discusses clustering Nvidia DGX Spark and M3 Ultra Mac Studio to achieve about four times faster LLM inference

View source
[2]
Reddit

SGLang vs TabbyAPI & vLLM Benchmark Increases (Multi-GPU + Single-GPU)

Benchmarks compare SGLang with TabbyAPI and vLLM across multi-GPU systems, highlighting speedups, quantization, NVLink, virtualization, and power tradeoffs in practice.

View source
[3]
Reddit

PaddleOCR-VL, is better than private models

PaddleOCR praised; peers compare Qwen3-VL, Gemini, Mistral; discusses vertical text, handwriting, testing, GPU issues, and model strengths versus others overall.

View source
[4]
HackerNews

New coding models and integrations

Discussion covers llama.cpp, Ollama, GGML/GGUF, GLM-4.6, Claude, Qwen3-Coder, and local vs cloud setups, with price, attribution, and performance notes throughout.

View source
[5]
HackerNews

Writing an LLM from scratch, part 22 – training our LLM

A series on building an LLM from scratch; discusses intuition, code access, cost tradeoffs, and references Karpathy and Manning book.

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started