What does HackerNews think of data-diff?

Compare tables within or across databases

Language: Python

#48 in Database
#21 in MySQL
#30 in PostgreSQL
#50 in PostgreSQL
#164 in Python
#44 in SQL
If the issue happen a lot, there is also: https://github.com/datafold/data-diff

That is a nice tool to do it cross database as well.

I think it's based on checksum method.

I did data engineering for 6 years and am building a company to automate SQL validation for dbt users.

First, by “testing SQL pipelines”, I assume you mean testing changes to SQL code as part of the development workflow? (vs. monitoring pipelines in production for failures / anomalies).

If so:

1 – assertions. dbt comes with a solid built-in testing framework [1] for expressing assertions such as “this column should have values in the list [A,B,C]” as well checking referential integrity, uniqueness, nulls, etc. There are more advanced packages on top of dbt tests [2]. The problem with assertion testing in general though is that for a moderately complex data pipeline, it’s infeasible to achieve test coverage that would cover most possible failure scenarios.

2 – data diff: for every change to SQL, know exactly how the code change affects the output data by comparing the data in dev/staging (built off the dev branch code) with the data in production (built off the main branch). We built an open-source tool for that: https://github.com/datafold/data-diff, and we are adding an integration with dbt soon which will make diffing as part of dbt development workflow one command away [2]

We make money by selling a Cloud solution for teams that integrates data diff into Github/Gitlab CI and automatically diffs every pull request to tell you the how a change to SQL affects the target table you changed, downstream tables and dependent BI tools (video demo: [3])

I’ve also written about why reliable change management is so important for data engineering and what are key best practices to implement [4]

[1] https://docs.getdbt.com/docs/build/tests [2] https://github.com/calogica/dbt-expectations [3] https://github.com/datafold/data-diff/pull/364 [4] https://www.datafold.com/dbt [5] https://www.datafold.com/blog/the-day-you-stopped-breaking-y...

Looks useful! Do you have a way to validate that the data was copied correctly and entirely? If not, you might want to consider integrating data-diff for that - https://github.com/datafold/data-diff