GPU benchmarks and sparse attention: practical paths to cheaper LLM inference

The big move in 2025 is cheaper LLM inference via 4-bit models on commodity GPUs and smarter attention patterns. In one thread, throughput benchmarks on RTX 4090, RTX 5090, and RTX PRO 6000 show how 4-bit quantization can tilt cost-efficiency and scale, with PCIe bottlenecks creeping in as you multi-GPU. Throughput scales nicely when a single 4090 handles a 4-bit model that fits on its memory, while multi-GPU setups wrestle with cross-GPU chatter ^[1].

Bold path: 4-bit LLMs on RTX GPUs - Qwen3-Coder-30B-A3B-Instruct-AWQ fits 24GB and scales linearly as you add GPUs via vLLM serving and vllm bench serve ^[1]. - Meta-Llama-3.3-70B-Instruct-AWQ-INT4 fits 48GB, but benefits drop a bit with PCIe overhead in multi-GPU configs ^[1]. - GLM-4.5-Air-AWQ-4bit needs 96GB, pushing you toward all four RTX 4090 cards, where PCIe chatter becomes a bottleneck and RTX PRO 6000 can edge ahead ^[1].

The price tag matters too: around $0.39/hour for the 4090, $0.65/hour for the 5090, and about $1/hour for Pro-class hardware, influencing the math of deployment ^[1].

Algorithmic path: DeepSeek 3.2 sparse attention - DeepSeek 3.2 uses dynamic sparsity with a lightning indexer to decide what to attend to, reducing compute when long contexts don’t need all tokens ^[2]. - The project mentions FlashMLA kernels and explores open-source work with Triton-based efforts from the community, including Together AI experiments and sparse routing ideas from MoE frameworks ^[2]. - The idea of Multi Head Latent Attention is highlighted as a faster, more expressive alternative in related papers and discussions ^[2].

Bottom line: you’re choosing between raw hardware throughput and smarter, sparser algorithms—both paths tighten the leash on cost for scalable LLMs ^[1]^[2].

References

[1]

Benchmarking LLM Inference on RTX 4090 / RTX 5090 / RTX PRO 6000 #2

Benchmarks LLM inference throughput on RTX GPUs; compares three 4-bit models; notes PCIe bottlenecks and cost-efficiency across multiple GPU configurations.

View source

[2]

[R] DeepSeek 3.2's sparse attention mechanism

Discusses DeepSeek sparse attention, dynamic token selection, efficiency; compares to MLA, MoE, LongFormer; seeks open-source implementations and papers.

View source

References

Benchmarking LLM Inference on RTX 4090 / RTX 5090 / RTX PRO 6000 #2

[R] DeepSeek 3.2's sparse attention mechanism

Want to track your own topics?