What does HackerNews think of htslib?
C library for high-throughput sequencing data formats
Language:
C
There is another nice multi-core gzip based library called BGZF[1]. It is commonly used in bioinformatics. BGZF has the added advantage that it is block compressed with built in indexing method to permit seeking in compressed files.
Fascinating, but there exist already htslib [0] bindings for Python (and many other languages). htslib truly is the standard library with respect to high-throughput sequencing data file access, and with high level bindings, we can already write something like:
``` for seq in bamfile: print(seq.pos) ``` or whatever.
On the practical side, if you're working on a low level with NGS data, htslib[1] may be worth looking into. It is a C library for reading, writing, and manipulating data structures that are commonly used in NGS (BAM, VCF, etc). I have used it and can attest to its quality. However, as is the issue with all software related to genomics, its only documentation is its header files and example programs. Here is the very example I used to get started[2]. The comments in the header files are usually good enough.
The reason I'm recommending it is the quality of its interfaces. It can seamlessly handle (input or output) virtually any kind of file you throw at it (SAM, BAM, CRAM). I can't say the same for a lot of other software I have run into in this space.
[1]: https://github.com/samtools/htslib
[2]: https://gist.github.com/PoisonAlien/350677acc03b2fbf98aa