Back to topics

Cost vs Speed: Real-World Trade-offs in Inference for Open-Source LLMs

1 min read
207 words
Opinions on LLMs Speed: Real-World

Cost vs speed is the hot metric for open‑source LLMs. Posts spotlight practical levers like DeepSeek OCR using vLLM that claim up to 10x cheaper cloud GPU runs [1].

Quantization tug of war On quantization, the thread swings from GPTQ to AWQ. Latency is on par, and in tests GPTQ can edge out in some setups. The scene is shifting toward llm-compressor; old GPTQ paths are being described as w4a16, with w8a8 variants also mentioned [2].

Middleware middle layer to cut LLM costs A middleware concept caches prompts, trims and summarizes context, and routes calls to cheaper models, delivering real savings. Early tests show around 30% token-cost reduction with similar output [3], with OpenRouter cited as a practical example.

Batching on local GPUs People are batch-inferencing on a single card like the RTX 4080. In one thread, Ollama runs with Gemma-3 12b on 4080, but support for GGUF models is spotty; vLLM has limited GGUF support, nudging users toward unsloth/gemma-3-12b-it-bnb-4bit or gaunernst/gemma-3-12b-it-int4-awq paths [4].

Hardware bottlenecks in llama.cpp Meanwhile llama.cpp shows long‑context RAM bottlenecks that can halve throughput as the prompt grows beyond tens of thousands of tokens [5].

Bottom line: price and speed trade-offs hinge on model choice, context length, and hardware realities—RAM first, price second.

References

[1]
HackerNews

DeepSeek OCR with Vllm – 10x cheaper on cloud GPU

Video claims Vllm enables cheaper LLM-powered OCR on cloud GPUs, highlighting cost reductions and performance

View source
[2]
Reddit

Fall of GPTQ and Rise of AWQ. Why exactly?

Compares GPTQ, AWQ, w4a16; notes latency/performance, calibration risks, and shift to w8a8/FP8-INT8 with vLLM, naming shift to w4a16 globally

View source
[3]
Reddit

API middle layer to automatically cut LLM costs

Proposes an LLM middleware to cache prompts, summarize context, route to cheap models; ~30% cost cut; seeks feedback and input.

View source
[4]
Reddit

Batch inference locally on 4080

Discusses local batch inference on 4080 with Gemma-3 12b, memory/quantization challenges, suggested options and bypassing wrappers using vLLM serve directly.

View source
[5]
Reddit

Llama.cpp New Ram halves inference speed at a higher context

User reports llama-server slows with long context on new RAM; discusses NUMA, memory bandwidth, BIOS settings, and troubleshooting tips

View source

Want to track your own topics?

Create custom trackers and get AI-powered insights from social discussions

Get Started