Bioinformatics file formats are broken, but it's better to try to understand why they are broken before rushing to fix them.

One key problem is that technology changes quickly. There are always new instruments generating new kinds of data with new properties and new features. People are using that data in new applications.

Software comes many years behind the state of the art. First you need to figure out what is the exact problem the software is supposed to solve. Then you have to solve the problem and turn the prototype into a useful tool. This work is mostly done by researchers who may know a little about software engineering. By the time the situation is stable enough that software engineers who are not active researchers in the field could be useful, it's often too late to change the file formats. There is already too much legacy data and too many tools supporting the established formats.

Another key problem is that the "broken" file formats are often good enough. When you have tabular data where the fields can be reasonably understood as text, a simple TSV-based format often gets the job done. Especially if individual datasets are only tens of gigabytes. By using a custom format, you avoid having to choose from many existing formats that all have their own issues. And that often guarantee you version conflicts and breaking changes in the future.

Also, when it comes to parallelization, it's hard to beat running many independent jobs in parallel. While computers are getting bigger, individual problems are often not, as the underlying biological problems remain the same. In the work I do, a reasonable target system had 32 cores and 256 GB memory in 2015. That's still a reasonable target in 2022. The computers I use have become cheaper and faster, but they have not really changed.

cycomanic

So what prevented the use HDF or netcdf in bioinformatics? I mean these are not new format by any definition of new and I disagree the broken formats are often not good enough, it seems more that often the bio/med fields (probably/hopefully not bioinformatics) are still used to process data using excel (speaking from my limited interaction with mainly med researchers).

jltsiren

Relative obscurity, most likely. If the formats are not used in bioinformatics, people developing bioinformatics tools are usually not familiar with them. And if developers are not familiar with the formats, they can't make informed decisions about using them. Yet another consequence of researcher-driven software development.

heuermh

We presented using Parquet formats for bioinformatics 2012/13-ish at the Bioinformatics Open Source Conference (BOSC) and got laughed out of the place.

While using Apache Spark for bioinformatics [0] never really took off, I still think Parquet formats for bioinformatics [1] is a good idea, especially with DuckDB, Apache Arrow, etc. supporting Parquet out of the box.

0 - https://github.com/bigdatagenomics/adam

1 - https://github.com/bigdatagenomics/bdg-formats