The hardest part of genomics for me has honestly been figuring out which open source poorly maintained tool I should use for a particular problem. and which options should be run and how the data need to be preprocessed before hand.

I mean has anyone ever actually read the documentation of the GATK? It is famously dreadful. And that's professionally maintained.

Honestly a nice addition here would be a "so you want to" with snippets of raw FASTQ or VCF data and working code for various operations, maybe with an accompanying Docker container.

I feel like ADAM (https://github.com/bigdatagenomics/adam) is a huge step in the right direction. You convert from standard genomics format to Parquet and then work with the resulting data in spark with genomics-specific libraries.

My experience has been translating domain data into spark has a 100X improvement in data analysis.