>PyTorch/Nvidia GPUs easily overtaking TensorFlow/Google TPUs.
TF lost to PyTorch, and this is Google’s fault - TF APIs are both insane and badly documented.
But nothing comes close to performance of Google’s TPU exaflop mega-clusters. Nvidia is not even in the same ballpark.
NVidia A100 DGX Superpod is equivalent to exaflop TPU pod. No?
Perhaps NVidia is close now. A bit hard to say without specific hardware info.
Google’s were already available 5-6 years ago. And probably current versions are even faster. They have super fast optical interconnects in torus or hyper-torus configuration that allow synchronous weight updates on 1k+ TPUs. This leads to dramatically lower training times and less noise, which leads to better-performing models. I.e. you can’t even train model to the same level on traditional GPUs.
Once they started to get deployed, models that trained for 3 weeks on 30 GPUs were trained in 30 minutes on 1k TPU cluster.
All this reiterated main point in the article - Google had tremendous lead and wasted it due to the lack of vision and product execution ability.