Cross-GPU KV Caches: A Practical Path to Lower Inference Latency for LLMs

Cross-GPU KV caches are turning heads. The project KV Marketplace demonstrates sharing LLM attention caches across GPUs via peer-to-peer caching, slashing distributed inference latency.

What it is - It’s a small research prototype by neelsomani to test cross-GPU reuse of transformer attention states ^[1]. Each process exports completed prefix KV tensors (key/value attention states) into a registry keyed by a hash of the input tokens and model version ^[1]. Other processes with the same prefix can import those tensors directly from a peer GPU, bypassing host memory and avoiding redundant prefill compute ^[1]. The idea touches on ideas from vLLM’s local KV cache and LMCache’s multi‑tier approach ^[1].

How it works - KV Marketplace focuses narrowly on the GPU-to-GPU fast path: peer-to-peer prefix reuse over RDMA or NVLink ^[1]. The code is intentionally minimal (no distributed registry, eviction, or CPU/disk tiers yet) ^[1]. Under optimistic conditions (perfect prefix importing), the prototype shows about a 15% reduction in latency and throughput gains ^[1].

Why it matters - It’s a minimal proof‑of‑concept aimed at teams exploring distributed LLM inference, caching, or RDMA transports ^[1]. If latency reductions hold, it could influence how multi‑GPU inference runtimes are designed and tuned.

Closing thought - The GitHub repo kv-marketplace offers a concrete starting point to experiment with cross‑GPU KV sharing and future cache-coherence ideas. Check it out to follow updates and future enhancements. ^[1]

References

[1]

HackerNews

Show HN: KV Marketplace – share LLM attention caches across GPUs like memcached

Prototype sharing transformer prefix KV across GPUs via peer-to-peer cache; aims to reduce latency in distributed LLM inference.

View source

References

Show HN: KV Marketplace – share LLM attention caches across GPUs like memcached

Want to track your own topics?