What does HackerNews think of pachyderm?

Data-Centric Pipelines and Data Versioning

Language: Go

#30 in Docker

#33 in Go

Show HN: We scaled Git to support 1 TB repos | Dec 2022

There are a couple of other contenders in this space. DVC (https://dvc.org/) seems most similar.

If you're interested in something you can self-host... I work on Pachyderm (https://github.com/pachyderm/pachyderm), which doesn't have a Git-like interface, but also implements data versioning. Our approach de-duplicates between files (even very small files), and our storage algorithm doesn't create objects proportional to O(n) directory nesting depth as Xet appears to. (Xet is very much like Git in that respect.)

The data versioning system enables us to run pipelines based on changes to your data; the pipelines declare what files they read, and that allows us to schedule processing jobs that only reprocess new or changed data, while still giving you a full view of what "would" have happened if all the data had been reprocessed. This, to me, is the key advantage of data versioning; you can save hundreds of thousands of dollars on compute. Being able to undo an oopsie is just icing on the cake.

Xet's system for mounting a remote repo as a filesystem is a good idea. We do that too :)

Airflow's Problem | Aug 2022

I was at Airbnb when we open-sourced Airflow, it was a great solution to the problems we had at the time. It's amazing how many more use cases people have found for it since then. At the time it was pretty focused on solving our problem of orchestrating a largely static DAG of SQL jobs. It could do other stuff even then, but that was mostly what we were using it for. Airflow has become a victim of its success as it's expanded to meet every problem which could ever be considered a data workflow. The flaws and horror stories in the post and comments here definitely resonate with me. Around the time Airflow was opensource I starting working on data-centric approach to workflow management called Pachyderm[0]. By data-centric I mean that it's focused around the data itself, and its storage, versioning, orchestration and lineage. This leads to a system that feels radically different from a job focused system like Airflow. In a data-centric system your spaghetti nest of DAGs is greatly simplified as the data itself is used to describe most of the complexity. The benefit is that data is a lot simpler to reason about, it's not a living thing that needs to run in a certain way, it just exists, and because it's versioned you have strong guarantees about how it can change.

[0] https://github.com/pachyderm/pachyderm

Launch HN: Replicate (YC W20) – Version control for machine learning | Nov 2020

Congrats on the launch! This looks interesting, however I feel like this space is quite crowded. You mentioned that your most important feature is the fact that you are open-source, but off the top of my head I can think of several projects:

* Kubeflow: https://github.com/kubeflow/kubeflow

* MLFlow: https://github.com/mlflow/mlflow

* Pachyderm: https://github.com/pachyderm/pachyderm

* DVC: https://github.com/iterative/dvc

* Polyaxon: https://github.com/polyaxon/polyaxon

* Sacred: https://github.com/IDSIA/sacred

* pytorch-lightning + grid: https://github.com/PyTorchLightning/pytorch-lightning

* DeterminedAI: https://github.com/determined-ai/determined

* Metaflow: https://github.com/Netflix/metaflow

* Aim: https://github.com/aimhubio/aim

* And so many more...

In addition to this list, several other hosted platform offer experiments tracking and model management. How do you compare to all of these tools, and why do you think users should move from one of them to use replicate, thank you.

Ask HN: Who is hiring? (October 2019) | Oct 2019

Pachyderm (YC W15) -- San Francisco -- SF or remote (within North America) -- https://jobs.lever.co/pachyderm/

Positions:

* Core distributed systems/infrastructure engineer (Golang)- You’ll be solving hard algorithmic and distributed systems problems every day and building a first-of-its-kind, containerized, data infrastructure platform.

* Front-end Engineer (Javascript) - Your work will be focused on developing the UI, perfecting the user experience, and pioneering new products such as a hosted version of Pachyderm's data solution.

* DevOps -- Pachyderm is hiring a deployment and devops expert to own and lead our infrastructure, deployment, and testing processes. Experience with Kubernetes, CI/CD systems, testing infra, and running large-scale, data-heavy applications is important.

* Solutions Engineer/Architect -- Work with Pachyderm’s OSS and Enterprise customers to ensure their success. This is a customer facing role that bridges support, product, customer success, and engineering. About Pachyderm:

Love Docker, Golang, Kubernetes and distributed systems? Pachyderm is an enterprise data science platform that offers Git-like version control semantics for massive data sets and end-to-end data lineage tracking and auditing. Teams that find themselves struggling to maintain a growing mess of advance data science tasks such as machine learning or bioinformatics/genomics research use Pachyderm to greatly simplify their system and reduce development time. They rely on Pachyderm to do the heavy lifting so they can focus on the business logic in their data pipelines.

Pachyderm raised our Series A led by Benchmark (https://pachyderm.io/2018/11/15/Series-A.html), so you'd be getting in right at the ground floor and have an enormous impact on the success and direction of the company as well as building the rest of the engineering team.

Check us out at:

pachyderm.com

https://github.com/pachyderm/pachyderm

Ask HN: Who is hiring? (August 2019) | Aug 2019

Pachyderm (W15) | US remote or San Francisco | Senior Front-end Engineer

https://github.com/pachyderm/pachyderm

https://jobs.lever.co/pachyderm/

Pachyderm is looking for a Javascript expert to help lead the web front-end, enterprise dashboard UI, and cluster visualization layer of Pachyderm! Pachyderm is just 15 people right now, so you'd be getting in right at the ground floor and have an enormous impact on the success and direction of the company.

Experience with full product life cycles and designing interfaces that are easily updated over time as products evolve is a must.

We also offer significant equity, full benefits, and all the usual startup perks.

Other Positions: https://jobs.lever.co/pachyderm/

* Front-end JS engineer

* Full-stack backend/web services engineer

* Core distributed systems/infrastructure engineer (Golang)

Our hiring process is focused around strong communication skills and simulating our actual work environment, not BS coding questions.

Read more about our company vision and goals:

What would data analytics infrastructure (namely Hadoop) look like if we rebuilt it from scratch today? We think it would be containerized, modular, and easy enough for a single person to use while still being scalable enough for a whole company. Tools like Docker and Kubernetes provide the perfect building blocks for us revolutionize data infrastructure!

https://medium.com/pachyderm-data/lets-build-a-modern-hadoop...

Databricks open-sources Delta Lake to make data lakes more reliable | Apr 2019

Expand Context ↕

Throwing my own project's hat in the ring, Pachyderm[0] is opensource, written in Go and built on Docker and Kubernetes. It versions controls your data, makes modifications atomic and tracks data lineage. You implement your pipelines in containers, so any tool you can put in a container can be used (and horizontally scaled) on Pachyderm.

[0] https://github.com/pachyderm/pachyderm

Torus: A Toolkit for Docker-First Data Science | Jun 2018

This is interesting! It sounds like this v1 gets your local environment up and running in a Docker container. I maintain something similar for analysts on my team, and we've seen success in terms of decreasing time spent on environment setup.

As another interesting use of Docker in the data space, I'm excited about Pachyderm [0] (though I haven't had the chance to use it in production). In particular, the data provenance story seems compelling.

0: https://github.com/pachyderm/pachyderm

CometML wants to do for machine learning what GitHub did for code | Apr 2018

Expand Context ↕

Are you planning to open source it?

A lot of your competitors have, like http://pipeline.ai/, https://github.com/pachyderm/pachyderm and recently https://github.com/polyaxon/polyaxon.

Docker for Data Science | Feb 2018

Docker is really starting to be used a lot in data science. Kubernetes too as it makes it easy to run that code in a distributed way. There's starting to be an ecosystem of tools that help with this too. Such as Kubeflow [0] which brings Tensorflow to Kubernetes in a clean way. I work on one myself [1] that manages the data as well as using docker containers for the code so that your whole data process is reproducible.

[0] https://github.com/kubeflow/kubeflow

[1] https://github.com/pachyderm/pachyderm

Show HN: Interactive map for architecting big data pipelines | Jun 2017

Expand Context ↕

If you're looking for something that doesn't constrain you to a particular language take a look at Pachyderm. It's built around containers so you can run any code you want. I designed it with JVM-phobes like you (and me) in mind.

https://github.com/pachyderm/pachyderm

Ask HN: How do you version your data? | Feb 2017

Check out Pachyderm [0]. It supports distributed, version-controlled data storage. The API is very Git-like: you modify data by making commits.

[0] https://github.com/pachyderm/pachyderm

S4: Distributed stream computing platform from Apache | Jan 2016

Shameless self promotion, Pachyderm is a slightly different approach to this problem.

https://github.com/pachyderm/pachyderm