Benchmark battles are real: vLLM paired with Qwen-3-VL-30B-A3B is posting eye-popping speeds. A discussion note reports 137 TPS generation with llama.cpp split across 3090 and 4090 on the same model. That kind of cross-GPU trick is what enthusiasts chase as the space gets more crowded [1].
vLLM performance with Qwen-3-VL-30B-A3B on H100 PCIe On an H100 PCIe setup, the combo hits averages of 549.0 tokens/s for prompts and 357.8 tokens/s for generation, with 7 requests running and 1 waiting. GPU KV cache usage sits at 0.2% and the prefix cache hit rate is 49.5% [1]. The thread also notes multi-GPU configurations and practical constraints like caching and distribution work aren’t trivial—even when the hardware is top‑tier. The model used is the quantized AWQ variant [1].
GLM-4.6 on 8x H200 NVL: 44 tok/s baseline Running GLM-4.6 on 8x H200 NVL with tensor-parallel-size 8 yields about 44 tokens/second (fully dense, no quantization). Enable-expert-parallel helps spread the load across experts, though setup can be finicky [2]. Some testers report faster TPS with lower tensor-parallel and PP settings (noting this context often involves AWQ or FP8) [2].
Takeaway: fast isn’t one number; it’s a function of workload, caching, and how you shard/offload across GPUs. In practice, vLLM‑driven multi‑GPU stacks push hundreds of tokens per second in some setups, while GLM-4.6 on massive rigs gives a more conservative baseline that can improve with config tweaks and cache strategy [1][2].
References
vLLM + Qwen-3-VL-30B-A3B is so fast
User reports vLLM speeds with Qwen-3 VL-30B-A3B across GPUs; discusses throughput, caching, offloading, hardware, and comparisons ollama versus Llama.cpp inference.
View sourcevLLM - GLM-4.6 Benchmark on 8xH200 NVL: 44 token/second
Discusses vLLM performance with GLM-4.6 across GPUs, including throughput, parallelism, network considerations, and comparisons to llama.cpp, SGLang, and AWQ/MTP variants.
View source