>Distributed Training: Gloo is now supported for distributed training jobs.

This is very interesting. Can someone talk about the roadmap of pytorch here ? It seems everyone is kinda rolling their own -

Pytorch has a very confusing distribution story

- OpenAI runs Pytorch on Kubernetes with handrolled MPI+SSH

- https://pytorch.org/tutorials/beginner/dist_overview.html

- https://pytorch.org/docs/stable/distributed.elastic.html

- https://pytorch.org/torchx/latest/

- https://www.kubeflow.org/docs/components/training/pytorch/

- Pytorch-Biggraph is specifically using torch.distributed with gloo (with an MPI backend).

So here's the question - if ur a 2 person startup that wants to do Pytorch distributed training using one of the cloud-managed EKS/AKS/GKE services... what should you use ?

The pytorch lightning people have come up with grid.ai, I personally have obtained good results by using pytorch lightning plus slurm on HPC machines. If I were a startup, I would probably try to build my own small HPC cluster, since that is far more cost effective than renting.
so most early stage startups get tens of thousands of dollars of free AWS credits. https://aws.amazon.com/activate/ 100K if ur part of a university accelerator.

it is far far more efficient (as a proportion of time-to-market) to rent and build on top of services.

Kubernetes is where the wider ecosystem is. I dont like it ...but it is what it is.

So Grid.ai is something like AWS Sagemaker. I wanted to figure out what someone can use on a readymade kubernetes cluster.

Check out Determined (https://github.com/determined-ai/determined). It supports deploying onto k8s and handles running horovod (and soon other dtrain backends), with most of the complexity abstracted behind a few configuration values. Also, it gives you stuff like experiment tracking / hp search (asha) /scheduling / profiling and etc.

Disclaimer: I work for Determined.