Cross-GPU KV caches are turning heads. The project KV Marketplace demonstrates sharing LLM attention caches across GPUs via peer-to-peer caching, slashing distributed inference latency.
What it is - It’s a small research prototype by neelsomani to test cross-GPU reuse of transformer attention states [1]. Each process exports completed prefix KV tensors (key/value attention states) into a registry keyed by a hash of the input tokens and model version [1]. Other processes with the same prefix can import those tensors directly from a peer GPU, bypassing host memory and avoiding redundant prefill compute [1]. The idea touches on ideas from vLLM’s local KV cache and LMCache’s multi‑tier approach [1].
How it works - KV Marketplace focuses narrowly on the GPU-to-GPU fast path: peer-to-peer prefix reuse over RDMA or NVLink [1]. The code is intentionally minimal (no distributed registry, eviction, or CPU/disk tiers yet) [1]. Under optimistic conditions (perfect prefix importing), the prototype shows about a 15% reduction in latency and throughput gains [1].
Why it matters - It’s a minimal proof‑of‑concept aimed at teams exploring distributed LLM inference, caching, or RDMA transports [1]. If latency reductions hold, it could influence how multi‑GPU inference runtimes are designed and tuned.
Closing thought - The GitHub repo kv-marketplace offers a concrete starting point to experiment with cross‑GPU KV sharing and future cache-coherence ideas. Check it out to follow updates and future enhancements. [1]
References
Show HN: KV Marketplace – share LLM attention caches across GPUs like memcached
Prototype sharing transformer prefix KV across GPUs via peer-to-peer cache; aims to reduce latency in distributed LLM inference.
View source