Genozip: A universal extensible genomic data compressor

Divon Lan*, Ray Tobler, Yassine Souilmi*, Bastien Llamas*

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    24 Citations (Scopus)

    Abstract

    We present Genozip, a universal and fully featured compression software for genomic data. Genozip is designed to be a general-purpose software and a development framework for genomic compression by providing five core capabilities - universality (support for all common genomic file formats), high compression ratios, speed, feature-richness and extensibility. Genozip delivers high-performance compression for widelyused genomic data formats in genomics research, namely FASTQ, SAM/BAM/CRAM, VCF, GVF, FASTA, PHYLIP and 23andMe formats. Our test results show that Genozip is fast and achieves greatly improved compression ratios, even when the files are already compressed. Further, Genozip is architected with a separation of the Genozip Framework from file-format-specific Segmenters and data-type-specific Codecs. With this, we intend for Genozip to be a general-purpose compression platform where researchers can implement compression for additional file formats, as well as new codecs for data types or fields within files, in the future. We anticipate that this will ultimately increase the visibility and adoption of these algorithms by the user community, thereby accelerating further innovation in this space.

    Original languageEnglish
    Pages (from-to)2225-2230
    Number of pages6
    JournalBioinformatics
    Volume37
    Issue number16
    DOIs
    Publication statusPublished - 15 Aug 2021

    Fingerprint

    Dive into the research topics of 'Genozip: A universal extensible genomic data compressor'. Together they form a unique fingerprint.

    Cite this