If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)
The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.
Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)
* Kubeflow: https://github.com/kubeflow/kubeflow
* MLFlow: https://github.com/mlflow/mlflow
* Pachyderm: https://github.com/pachyderm/pachyderm
* DVC: https://github.com/iterative/dvc
* Polyaxon: https://github.com/polyaxon/polyaxon
* Sacred: https://github.com/IDSIA/sacred
* pytorch-lightning + grid: https://github.com/PyTorchLightning/pytorch-lightning
* DeterminedAI: https://github.com/determined-ai/determined
* Metaflow: https://github.com/Netflix/metaflow
* Aim: https://github.com/aimhubio/aim
* And so many more...
In addition to this list, several other hosted platform offer experiments tracking and model management. How do you compare to all of these tools, and why do you think users should move from one of them to use replicate, thank you.
Positions:
* Core distributed systems/infrastructure engineer (Golang)- You’ll be solving hard algorithmic and distributed systems problems every day and building a first-of-its-kind, containerized, data infrastructure platform.
* Front-end Engineer (Javascript) - Your work will be focused on developing the UI, perfecting the user experience, and pioneering new products such as a hosted version of Pachyderm's data solution.
* DevOps -- Pachyderm is hiring a deployment and devops expert to own and lead our infrastructure, deployment, and testing processes. Experience with Kubernetes, CI/CD systems, testing infra, and running large-scale, data-heavy applications is important.
* Solutions Engineer/Architect -- Work with Pachyderm’s OSS and Enterprise customers to ensure their success. This is a customer facing role that bridges support, product, customer success, and engineering. About Pachyderm:
Love Docker, Golang, Kubernetes and distributed systems? Pachyderm is an enterprise data science platform that offers Git-like version control semantics for massive data sets and end-to-end data lineage tracking and auditing. Teams that find themselves struggling to maintain a growing mess of advance data science tasks such as machine learning or bioinformatics/genomics research use Pachyderm to greatly simplify their system and reduce development time. They rely on Pachyderm to do the heavy lifting so they can focus on the business logic in their data pipelines.
Pachyderm raised our Series A led by Benchmark (https://pachyderm.io/2018/11/15/Series-A.html), so you'd be getting in right at the ground floor and have an enormous impact on the success and direction of the company as well as building the rest of the engineering team.
Check us out at:
pachyderm.com
https://github.com/pachyderm/pachyderm
https://jobs.lever.co/pachyderm/
Pachyderm is looking for a Javascript expert to help lead the web front-end, enterprise dashboard UI, and cluster visualization layer of Pachyderm! Pachyderm is just 15 people right now, so you'd be getting in right at the ground floor and have an enormous impact on the success and direction of the company.
Experience with full product life cycles and designing interfaces that are easily updated over time as products evolve is a must.
We also offer significant equity, full benefits, and all the usual startup perks.
Other Positions: https://jobs.lever.co/pachyderm/
* Front-end JS engineer
* Full-stack backend/web services engineer
* Core distributed systems/infrastructure engineer (Golang)
Our hiring process is focused around strong communication skills and simulating our actual work environment, not BS coding questions.
Read more about our company vision and goals:
What would data analytics infrastructure (namely Hadoop) look like if we rebuilt it from scratch today? We think it would be containerized, modular, and easy enough for a single person to use while still being scalable enough for a whole company. Tools like Docker and Kubernetes provide the perfect building blocks for us revolutionize data infrastructure!
https://medium.com/pachyderm-data/lets-build-a-modern-hadoop...
As another interesting use of Docker in the data space, I'm excited about Pachyderm [0] (though I haven't had the chance to use it in production). In particular, the data provenance story seems compelling.
A lot of your competitors have, like http://pipeline.ai/, https://github.com/pachyderm/pachyderm and recently https://github.com/polyaxon/polyaxon.