What does HackerNews think of adam?

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

Language: Scala

#27 in Java
#124 in Python
#4 in Scala
We presented using Parquet formats for bioinformatics 2012/13-ish at the Bioinformatics Open Source Conference (BOSC) and got laughed out of the place.

While using Apache Spark for bioinformatics [0] never really took off, I still think Parquet formats for bioinformatics [1] is a good idea, especially with DuckDB, Apache Arrow, etc. supporting Parquet out of the box.

0 - https://github.com/bigdatagenomics/adam

1 - https://github.com/bigdatagenomics/bdg-formats

We're here, still plugging along.

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

https://github.com/bigdatagenomics/adam

Hello! I'm wondering if you came across our suite of libraries and tools for doing Genomics on Spark?

https://github.com/bigdatagenomics/adam

We've fallen off the first Google hit the past few years but are still quite relevant (e.g. Databricks' commercial offering uses ADAM under the hood). Drop in our Gitter some time!

I feel like ADAM (https://github.com/bigdatagenomics/adam) is a huge step in the right direction. You convert from standard genomics format to Parquet and then work with the resulting data in spark with genomics-specific libraries.

My experience has been translating domain data into spark has a 100X improvement in data analysis.

At the UC Berkeley AMPLab we're working on scaling genomics [0], all open source under Apache 2 license. Or more generally, any of the Open Bioinformatics Foundation (OBF)[1] projects could use a hand, open source licenses vary.

[0] - https://github.com/bigdatagenomics/adam

[1] - https://www.open-bio.org/wiki/Main_Page

1) Join a lab as a programmer (e.g., National Weather Service (climate), J. Craig Institute (infectious diseases))

2) Read the computational papers on a subject you are interested; replicate the results in the papers and open source your software/pipeline; apply the method to a newer data-set.

3) Contribute to a open source informatics toolchain used for the subject (e.g., https://github.com/bigdatagenomics/adam)

Agreed.

Hope you don't mind a plug here for ADAM, a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark and Parquet.

https://github.com/bigdatagenomics/adam

Yes indeed. And new tools are being written - see the Adam project for an interesting example: https://github.com/bigdatagenomics/adam and the associated variant caller Avocado: https://github.com/bigdatagenomics/avocado. Others are also trying to get the old tools working on Hadoop, for instance Halvade: https://github.com/ddcap/halvade/wiki/Halvade-Manual, Hadoop-BAM https://github.com/HadoopGenomics/Hadoop-BAM, SeqPig: http://seqpig.sourceforge.net/, and the guys at BioBankCloud: https://github.com/biobankcloud. It's going to take quite a while for this stuff to get fleshed out, and for researchers to adopt it. But the sheer weight of data is going to force things in the Hadoop direction eventually. It is inevitable.