LLM inference on TPU hardware is sparking a toolkit debate. A Reddit thread notes that plain model.generate() stalls on TPUs, and coaxing PyTorch XLA into action is fiddly. The takeaway: the JAX route often feels smoother, especially with stacks like MaxText and JetStream; and some users mention optimum-tpu as a server option with caveats [1].
- Low-effort path — convert to a JAX-compatible checkpoint and use MaxText/JetStream. TPUs tend to behave better with static graphs [1].
High-effort path — stay with PyTorch but learn torch-xla/XLA quirks and refactor decoding [1].
Manual decode loop + force execution per token — this is the workaround that helps unblock PyTorch on TPUs, at the cost of complexity [1].
- Avoid dynamic branching — start with greedy or fixed max_tokens, and keep shapes static to reduce recompiles [1].
- TPU runtime knobs — consider the PJRT runtime, enable evaluation/inference modes, and use XLAUSEBF16=1 (or FP16 on newer hardware) with version‑matched torch/torch-xla [1].
TL;DR: quickest win = JAX route. If you insist on PyTorch, use a manual loop with careful token handling and static shapes, then layer back features like temperature/beam search later [1].
This snapshot from the discussion signals a scale-out reality: teams pick the path that aligns with their stack and deployment timelines, balancing speed, stability, and developer friction when pushing LLMs at scale [1].
References
[D] LLM Inference on TPUs
Discusses LLM inference on TPUs, comparing PyTorch XLA vs JAX paths; suggests JAX for simplicity, PyTorch with manual loops workarounds.
View source