TPU 8t & 8i at Google Cloud Next ’26: Training, Inference, and the Agentic Stack
In April 2026, Google Cloud published deep infrastructure updates aligned with Google Cloud Next ’26. This article pulls factual claims from Google’s official posts—start here: AI infrastructure at Next ’26 (April 22, 2026) and the companion recap on Google’s blog (April 24, 2026).
Why Google frames “agentic” infrastructure differently
Google describes the agentic era as one where a user intent triggers multi-step, multi-agent workflows with tool calls, state, and tight latency budgets—stressing CPUs for orchestration, accelerators for models, network fabric for scale-out, and storage to feed GPUs/TPUs without bottlenecks.
TPU 8t (training) and TPU 8i (inference / RL)
- TPU 8t: positioned as a training system—Google states roughly 3× higher compute than prior generations, with a cited configuration of 9,600 chips in one superpod delivering 121 exaflops and two petabytes of shared memory over high-speed ICI interconnects.
- TPU 8i: optimized for inference and RL; Google cites tripled on-chip SRAM (384 MB), 288 GB HBM, doubled ICI bandwidth (19.2 Tb/s), and up to 80% better performance per dollar for inference vs the prior generation in their accounting.
Architecture details: see Google’s technical deep dive linked from the main Next ’26 compute article.
Networking, storage, and Kubernetes for agents
Google highlights Virgo Network as a high-bandwidth data-center fabric for AI scale-out, Managed Lustre with large aggregate bandwidth, and GKE improvements (faster node/pod startup, model loading, and Inference Gateway routing). Native PyTorch on TPU (TorchTPU) appears as part of the open-software story alongside JAX and vLLM on TPU.
What students should take away
If you deploy RAG or agents, your bottleneck may not be the LLM—it may be retrieval latency, tool RTT, KV cache memory, or batching. Reading vendor-neutral guides (plus Google’s own numbers) helps you ask better questions when you move from notebooks to production.