To me, data pipelines are about moving data from one place to another. and that's not really covered by the SQL spec. The article doesn't specify what constitutes a "data engineering pipeline", but for me there's three components to it:

- orchestration, or what happens when

- data manipulation, aka the T in ETL

- data movement, getting data from A to B

Orchestration? Please do not use dynamic SQL, you're digging a hole you'll never get out of. Moreover, SQL has zero support for time-based triggers so you'll need a secondary system anyway.

Manipulating the data? Sure, use SQL, as long as you're operating on data from one database. Trying to collate data from multiple databases may have unpredictable performance issues even in the engines that support it.

Moving data? There are much better options. And the same caveat as for orchestration applies here: every RDBMS vendor has its own idea of what external query's should look like (openrowset, external tables, linked servers) with their own performance considerations.

kasey_junk

Even on the manipulation side I tend to disfavor sql.

Even though it’s improved a lot the testing situation in sql is still not where a more traditional programming language is, modularity comes either in the form of ever nested cte, stored procs or dbt style templates. And sql types are wholly dependent on the sql engine, which can lead to wacky transforms.

Sql is great for adhoc analysis. If something isn’t adhoc, there is almost always a better tool.

marcyb5st

Agreed. I would also point out maintainability. How do you test some SQL logic in isolation?

Additionally, in this day and age where enriching the data by running it through some ML model isn't that rare, doing it in SQL by exposing it through an API and invoking some UDF on a per-row basis is extremely inefficient due to network RTT. In my opinion it is much better to use something like Apache Beam and load the model in memory of your workers and run predictions "locally" on batches of data at the time.

On the other hand I see the value in expressing "simple" logic in SQL, especially when joining a series of tabular sources. That's why I am super happy with Apache beam SQL extensions (https://beam.apache.org/releases/pydoc/2.30.0/apache_beam.tr...) which, IMHO, has the benefits of both worlds.

dangwhy

> How do you test some SQL logic in isolation?

I do this using sql

1. Extracting an 'ephemeral model' to different model file

2. Mock out this model in upstream model in unit tests https://github.com/EqualExperts/dbt-unit-testing

3. Write unit tests for this model.

This is not different than regular software development in a language like java.

I would argue its even better better because unit tests are always in tabular format and pretty easy to understand. Java unit tests on other hand are never read by devs in practice.

> in this day and age where enriching the data by running it through some ML model isn't that rare,

Still pretty rare, This constitutes a very minor percentage of ETL in an typical enterprise.