Back to topics

MoE and Pruning: Is One-Shot Pruning the Future of Scale?

2 min read
311 words
Opinions on LLMs Pruning: One-Shot

MoE pruning is getting real: Cerebras's one-shot approach, REAP the Experts, claims to beat classic expert merging in practical benchmarks, all done in FP8. The implications for 120B+ models could tilt cost-to-performance in a big way [1].

One-shot MoE pruning vs mergingCerebras pruned Qwen3-Coder-480B to 363B (25%) and 246B (50%), using a saliency criterion that measures each expert's routed contribution, all in FP8. They report minimal accuracy degradation at 25% across a suite of benchmarks, with checkpoints available for the pruned models [1].

Impact on cost-to-performance with FP8 — Everything runs in FP8, signaling potential compute savings while keeping performance competitive. This setup underscores how pruning choices and precision formats reshape how we judge MoE scaling, beyond raw perplexity scores [1]. Additional notes in the discussion hint that even larger prunings may be feasible for certain model families, though real-world results vary by task [1].

RTX Pro 6000 Blackwell benchmarks — The RTX Pro 6000 Blackwell setup on vLLM hits the sweet spot for local 120B+ models: - Peak 1051 tok/s at 20 users, 1K context; 300-476 tok/s across contexts; TTFT 200-400 ms at low concurrency; avg 2.6s (1 user) → 30.2s (20 users) at 128K context [2]. - Extended run: ~1016 tok/s for 1000-2000 tokens; context scales with context length; 128K context shows no swapping thanks to 96GB VRAM headroom [2]. - Power draw 300-600W; throughput scales well up to 5 users, with excellent multi-user behavior up to 10 users [2].

Together, these signals show a path where FP8-driven pruning plus high-density GPUs influence scale decisions for 120B+ models without exploding costs. The caveat? benchmark results vary by task, especially in multi-turn or multiple-choice settings [1].

Closing thought: pruning and precision choices are moving the goalposts on how big models we can run locally really scale. Watch Cerebras’ next releases and Blackwell-enabled tests for clarity on real-world cost-to-performance.

References

[1]
Reddit

New from Cerebras: REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

One-shot MoE pruning beats merging; FP8, preserves accuracy; HF checkpoints; multiple models pruned.

View source
[2]
Reddit

RTX Pro 6000 Blackwell vLLM Benchmark: 120B Model Performance Analysis

Benchmarks RTX Pro 6000 Blackwell with vLLM on gpt-oss-120b show throughput, latency, multi-user scaling, MoE benefits, and price debates.

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started