This is very interesting. Can someone talk about the roadmap of pytorch here ? It seems everyone is kinda rolling their own -
Pytorch has a very confusing distribution story
- OpenAI runs Pytorch on Kubernetes with handrolled MPI+SSH
- https://pytorch.org/tutorials/beginner/dist_overview.html
- https://pytorch.org/docs/stable/distributed.elastic.html
- https://pytorch.org/torchx/latest/
- https://www.kubeflow.org/docs/components/training/pytorch/
- Pytorch-Biggraph is specifically using torch.distributed with gloo (with an MPI backend).
So here's the question - if ur a 2 person startup that wants to do Pytorch distributed training using one of the cloud-managed EKS/AKS/GKE services... what should you use ?
it is far far more efficient (as a proportion of time-to-market) to rent and build on top of services.
Kubernetes is where the wider ecosystem is. I dont like it ...but it is what it is.
So Grid.ai is something like AWS Sagemaker. I wanted to figure out what someone can use on a readymade kubernetes cluster.
Disclaimer: I work for Determined.