I’m bearish on new hardware for AI training. The most important thing is the software stack, and thus far everyone has failed to support pytorch in a drop-in way.

The philosophy here seems to be “if we build it, they’ll buy it.” But suppose you wanted to train a gpt model with this specialized hardware. That means you’re looking at two months of R&D minimum to get everything rewritten, running, tested, trained, and with an inferencing pipeline to generate samples.

And that’s just for gpt — you lose all the other libraries people have written. This matters more in GAN training, since for example you can find someone else’s FID implementation and drop it in without too much hassle. But with this specialized chip, you’d have to write it from scratch.

We had a similar situation in gamedev circa 2003-2009. Practically every year there was a new GPU, which boasted similar architectural improvements. But, for all its flaws, GL made these improvements “drop-in” —- just opt in to the new extension, and keep writing your gl code as you have been.

Ditto for direct3d, except they took the attitude of “limit to a specific API, not arbitrary extensions.” (Pixel shader 2.0 was an awesome upgrade from 1.1.)

AI has no such standards, and it hurts. The M1 GPU in my new Air is supposedly ready to do AI training. Imagine my surprise when I loaded up tensorflow and saw that it doesn’t support any GPU devices whatsoever. They seem to transparently rewrite the cpu ops to run on the gpu automatically, which isn’t the expected behavior.

So I dig into Apple’s actual api for doing training, and holy cow, that looks miserable to write in swift. I like how much control it gives you over allocation patterns, but I can’t imagine trying to do serious work in it on a daily basis.

What we need is a unified API that can easily support multiple backends — something like “pytorch, but just enough pytorch to trick everybody” since supporting the full api seems to be beyond hardware vendors’ capabilities at the moment. (Lookin’ at you, google. Love ya though.)

I don't really agree, at least from a longer term perspective. It's early days yet, but XLA seems to be a promising intermediate representation for allowing the DL frameworks to run on a wider array of hardware without user-facing software changes. It has traction with Google, NVIDIA, IIRC Intel and maybe more (others are definitely using the same approach of compute graph splitting and subgraph scheduling, but I'm not certain if they are using XLA specifically - I know some aren't like Mindspore).

XLA has already proven its value by allowing PyTorch to run on TPUs (shittily, but that appears to be more of a VM/GCP infra problem than an XLA problem). The work done for TPUs (and to a lesser extent for GPU optimization) has started to expose some of the major issues and so work can start on addressing them (the cost of dynamic XLA compilation as tensor shapes change and how lots of important code assumes that accelerator-to-CPU communication isn't tooo expensive, but it's is a huge issue when trying to compile the graph into machine-specific code with XLA or similar to because it forces you to only be able to compile small subgraphs).

It's early, but the rise of a really effective IR in XLA combined with the huge amount of resources that Google/NVIDIA can pour into XLA makes me very bullish on purpose-built hardware for AI training. It will take a while I admit.

Mr/mrs anonymous HN person, please put some info in your profile. You clearly have some deep knowledge of TPUs that I didn’t expect to pop up offhandedly on HN. You’re correct on all counts: dynamic tensor shapes are more or less impossible with XLA, making it more or less impossible to train a model with arbitrary image size inputs, even though the math would allow for that; the pytorch XLA work on TPUs is indeed kind of shitty, and I’m surprised as heck that literally anyone said this except me; and XLA as an IR is promising for portability. Now I’m curious what you’ve been doing to have experienced these things, since there didn’t seem to be many others who have (or at least, who are vocal about it).

I agree with you, but I think we differ on our timetables. I am bearish for the next two years, at which point I’ll awaken from my slumber and become a flaming bull. (It helps to remember that “we overestimate the impact of years, but underestimate the impact of decades.” I try to plan accordingly.)

In other words, if you’re bullish that two years from now we’ll start seeing portability implemented in the field across various HPC chips, then we fully agree. But that’s also a glacial pace; GPT-2 changed the world almost two years ago now, and DALL-E seems to be the next frontier for doing interesting generative work. So, we’ll split the difference and say that the bears and bulls will meet in two years for a deep learning hackathon. As a bonus, the pandemic will be over by then, so it can be an in-person meetup.

Ah I see - I think we're pretty much on the same page in terms of timetables. Although if you include TPU, I think it's fair to say that custom accelerators are already a moderate success.

Updated my profile. I've been working on DL training platforms and distributed training benchmarking for a bit so I've gotten a nice view into the GPU/TPU battle.

Shameless plug: you should check out the open-source training platform we are building, Determined[1]. One of the goals is to take our hard-earned expertise on training infrastructure and build a tool where people don't need to have that infrastructure expertise. We don't support TPUs, partially because a lack of demand/TPU availability, and partially because our PyTorch TPU experiments were so unimpressive.

[1] GH: https://github.com/determined-ai/determined, Slack: https://join.slack.com/t/determined-community/shared_invite/...