fastqz

fastqz is a compressor for FASTQ files. FASTQ is the output of DNA sequencing machines. It is one of the compressors described in the paper:
Bonfield JK, Mahoney MV (2013) Compression of FASTQ and SAM Format Sequencing Data. (mirror) PLoS ONE 8(3): e59190. doi:10.1371/journal.pone.0059190

A FASTQ file contains a list of DNA strings and quality scores. Specifically, this program only compresses the Sanger variant, which is probably the most common variation. FASTQ format.

The program is available as source code only. You will need libzpaq to compile it. For Windows you will also need Phreads-Win32. See source code comments for instructions on how to compile each program and for a description of the compression algorithm and compressed format.

fastqz is free software available under the BSD 2 clause license. It is written by Matt Mahoney. Email: mattmahoneyfl at gmail dot com. Algorithm description.

Usage

fastqz command input output [reference]

Commands are one of c d e f

c compresses input to 3 files: output.fxh.zpaq, output.fxb.zpaq, output.fxq.zpaq. The input should be in Sanger FASTQ format. Requires 1.5 GB memory

d decompresses input.fxh.zpaq, input.fxb.zpaq, input.fxq.zpaq to output. Requires 1.5 GB memory

e encodes input to 3 files: output.fxh, output.fxb, output.fxq. Encoding is faster than compressing and does not need much memory, but does not compress as small. The input should be in Sanger FASTQ format.

f decodes input.fxh, input.fxb, input.fxq to output.

You can give a reference genome to improve compression. The same reference must be given to decompress. You can use the fapack or fapacks program to convert FASTA files to the required format. When using a reference, there is a fourth compressed file (.fxa) of alignments.

You can also quantize the quality values for lossy but better compression. Use commands "cQ" or "eQ" where Q is the quantization step. Larger values compress better but lose more information. The default is c1 or e1 which is lossless.

Downloads

fastqz10.cpp v1.0 (Mar. 8, 2012).
fastqz11.cpp v1.1 (Mar. 9, 2012).
fastqz12.cpp v1.2 (Mar. 12, 2012).
fastqz13.cpp v1.3 (Mar. 14, 2012).
fastqz14.cpp v1.4 (Mar. 15, 2012).
fastqz15.cpp v1.5 (Mar. 15, 2012).
readme.txt.

fapack is a program for packing FASTA files 4 bases (A,C,G,T) per byte. fapacks also accepts lowercase (a,c,g,t) used to indicate repeats. It produces a larger reference genome but generally better compression.

fapack.cpp v1.0 (Mar. 12, 2012).
fapacks.cpp v1.0 (Mar. 13, 2012).