TPU vs PyTorch/XLA vs JAX: the ongoing toolkit debate for LLM inference

LLM inference on TPU hardware is sparking a toolkit debate. A Reddit thread notes that plain model.generate() stalls on TPUs, and coaxing PyTorch XLA into action is fiddly. The takeaway: the JAX route often feels smoother, especially with stacks like MaxText and JetStream; and some users mention optimum-tpu as a server option with caveats ^[1].

Low-effort path — convert to a JAX-compatible checkpoint and use MaxText/JetStream. TPUs tend to behave better with static graphs ^[1].
High-effort path — stay with PyTorch but learn torch-xla/XLA quirks and refactor decoding ^[1].
Manual decode loop + force execution per token — this is the workaround that helps unblock PyTorch on TPUs, at the cost of complexity ^[1].
Avoid dynamic branching — start with greedy or fixed max_tokens, and keep shapes static to reduce recompiles ^[1].
TPU runtime knobs — consider the PJRT runtime, enable evaluation/inference modes, and use XLAUSEBF16=1 (or FP16 on newer hardware) with version‑matched torch/torch-xla ^[1].

TL;DR: quickest win = JAX route. If you insist on PyTorch, use a manual loop with careful token handling and static shapes, then layer back features like temperature/beam search later ^[1].

This snapshot from the discussion signals a scale-out reality: teams pick the path that aligns with their stack and deployment timelines, balancing speed, stability, and developer friction when pushing LLMs at scale ^[1].

References

[1]

[D] LLM Inference on TPUs

Discusses LLM inference on TPUs, comparing PyTorch XLA vs JAX paths; suggests JAX for simplicity, PyTorch with manual loops workarounds.

View source

References

[D] LLM Inference on TPUs

Want to track your own topics?