The philosophy here seems to be “if we build it, they’ll buy it.” But suppose you wanted to train a gpt model with this specialized hardware. That means you’re looking at two months of R&D minimum to get everything rewritten, running, tested, trained, and with an inferencing pipeline to generate samples.
And that’s just for gpt — you lose all the other libraries people have written. This matters more in GAN training, since for example you can find someone else’s FID implementation and drop it in without too much hassle. But with this specialized chip, you’d have to write it from scratch.
We had a similar situation in gamedev circa 2003-2009. Practically every year there was a new GPU, which boasted similar architectural improvements. But, for all its flaws, GL made these improvements “drop-in” —- just opt in to the new extension, and keep writing your gl code as you have been.
Ditto for direct3d, except they took the attitude of “limit to a specific API, not arbitrary extensions.” (Pixel shader 2.0 was an awesome upgrade from 1.1.)
AI has no such standards, and it hurts. The M1 GPU in my new Air is supposedly ready to do AI training. Imagine my surprise when I loaded up tensorflow and saw that it doesn’t support any GPU devices whatsoever. They seem to transparently rewrite the cpu ops to run on the gpu automatically, which isn’t the expected behavior.
So I dig into Apple’s actual api for doing training, and holy cow, that looks miserable to write in swift. I like how much control it gives you over allocation patterns, but I can’t imagine trying to do serious work in it on a daily basis.
What we need is a unified API that can easily support multiple backends — something like “pytorch, but just enough pytorch to trick everybody” since supporting the full api seems to be beyond hardware vendors’ capabilities at the moment. (Lookin’ at you, google. Love ya though.)
XLA has already proven its value by allowing PyTorch to run on TPUs (shittily, but that appears to be more of a VM/GCP infra problem than an XLA problem). The work done for TPUs (and to a lesser extent for GPU optimization) has started to expose some of the major issues and so work can start on addressing them (the cost of dynamic XLA compilation as tensor shapes change and how lots of important code assumes that accelerator-to-CPU communication isn't tooo expensive, but it's is a huge issue when trying to compile the graph into machine-specific code with XLA or similar to because it forces you to only be able to compile small subgraphs).
It's early, but the rise of a really effective IR in XLA combined with the huge amount of resources that Google/NVIDIA can pour into XLA makes me very bullish on purpose-built hardware for AI training. It will take a while I admit.
I agree with you, but I think we differ on our timetables. I am bearish for the next two years, at which point I’ll awaken from my slumber and become a flaming bull. (It helps to remember that “we overestimate the impact of years, but underestimate the impact of decades.” I try to plan accordingly.)
In other words, if you’re bullish that two years from now we’ll start seeing portability implemented in the field across various HPC chips, then we fully agree. But that’s also a glacial pace; GPT-2 changed the world almost two years ago now, and DALL-E seems to be the next frontier for doing interesting generative work. So, we’ll split the difference and say that the bears and bulls will meet in two years for a deep learning hackathon. As a bonus, the pandemic will be over by then, so it can be an in-person meetup.
Updated my profile. I've been working on DL training platforms and distributed training benchmarking for a bit so I've gotten a nice view into the GPU/TPU battle.
Shameless plug: you should check out the open-source training platform we are building, Determined[1]. One of the goals is to take our hard-earned expertise on training infrastructure and build a tool where people don't need to have that infrastructure expertise. We don't support TPUs, partially because a lack of demand/TPU availability, and partially because our PyTorch TPU experiments were so unimpressive.
[1] GH: https://github.com/determined-ai/determined, Slack: https://join.slack.com/t/determined-community/shared_invite/...