NAME

zpaq - Archiver and compression algorithm development tool


SYNOPSIS

    zpaq [-I<options>] I<commmand> I<archive>[C<.zpaq>] [I<files>...]
    zpaq [-I<options>] r [I<input> [I<output>]]
    zpaq [-I<options>] t [I<arg>...]


DESCRIPTION

zpaq is a program to create, update, extract, and list archives compressed in ZPAQ level 1 format, and a tool for developing new compression algorithms. ZPAQ is based on the PAQ context-mixing compression algorithm, which is top ranked by compression ratio on many benchmarks. Unlike PAQ, which has over 160 incompatible versions, ZPAQ archives include a description of the decompression algorithm to maintain compatibility with older decompressers when better algorithms are developed. The ZPAQ format is open, requires no license, and is supported by a specification and reference decoder (see availability).

zpaq supports incremental update and restore. Before compressing or extracting a file, it computes the SHA-1 checksum of the external file and compares it with the stored checksum. If the files match, then the operation is skipped. To support huge archives, internal files are deleted in place.

zpaq has 4 built in compression levels to let you select between better speed or better compression ratio. It also accepts custom compression algorithms described by configuration files and optional external preprocessing programs. The extra files are not needed to decompress. Compression algorithms are described in a language called ZPAQL. zpaq includes devlopment tools to run, trace, and debug ZPAQL code, print model compression statistics, and write byte code. In addition, there are advanced options to skip saving error recovery tags, filenames, checksums, and file sizes to squeeze out the last bit of compression. Such archives cannot be updated incrementally.

zpaq supports block level parallel compression using all available processors. A block can be a file, part of a file, or multiple files (solid archive). Generally, larger blocks compress better. Solid archives cannot be updated incrementally.

zpaq internally translates ZPAQL bytecode into equivalent x86-32 or x86-64 code, making it about twice as fast as the reference decoder, even on a single thread.


COMMANDS

For all commands operating on archives (a, c, d, l, u, x), the archive is assumed to have a .zpaq extension if not specified.

a

Add files. For each file, if the internal and external files both exist and are different, then the internal file is deleted (maintaining the order of the other internal files) and a compressed copy of the external file is appended to the end of the archive. If the files are identical, then no action is taken. If the internal file does not exist, then the external file is appended. If the external file does not exist, then the internal file is deleted. If neither file exists, then no action is taken.

Files are saved with a filename as specified on the command line, a comment as a decimal string, and a SHA-1 checksum. If a file is split into multiple blocks, then each block has its own size and checksum, and subsequent filenames are empty. Each block includes a 13 byte header locator tag for error recovery and a description of the decompression algorithm.

Update fails for solid archives (created with -bs) if any internal file to be replaced is followed in the same block by a file not to be replaced. In this case, the archive remains unchanged.

Files are compared by computing SHA-1 checksums of the external file and comparing with the stored checksums of the corresponding segments of the internal file. If this information is missing, then the files are assumed to be different.

c

Create archive and compress. With one exception, c is equivalent to deleting the archive and adding files with a. The exception is that if no files are given as arguments, then archive is compressed to archive.zpaq and the filename is not saved in the archive. When extracted with x, these files would be decompressed by dropping the .zpaq extension.

d

Delete internal files that match names exactly. The command fails for solid archives (created with -bs) if any files to be deleted are followed by other files in the same block that are not to be deleted. If an archive contains more than one file with a matching name then all are deleted.

l

List archive content. There are no file arguments. A listing shows for each segment, the memory needed to decompress in MB (millions of bytes), original and compressed sizes in bytes, the first 4 of 20 bytes of the SHA-1 checksum in hexadecimal, and the filename as saved. If a file is split among multiple blocks, then the filename is shown for only the first block. If multiple files are packed into a single block (a solid archive), then the memory is shown for only the first segment in the block. It is not possible to update or delete a file without updating or deleting all of the files after it in the same block. Blocks are independent from each other and can be compressed or decompressed in parallel by separate threads, but segments within blocks must be processed sequentially.

l with no archive argument lists the HCOMP and PCOMP strings for the model selected by -m in a format suitable for pasting into a C++ program.

r [input [output]]

Run the post-processor (PCOMP section) of the configuration file specified by -m as a stand-alone program taking input from file input (default standard input) and writing to file output (default standard output). The code is executed once for each input byte and once more with -1 (EOF) as input (simulating what happens during decompression). The command is useful mainly for testing and debugging configuration files.

t [N]...

Trace the post-processor (PCOMP section) of the configuration file specified by -m once for each numeric argument N as input. As each ZPAQL instruction is executed, the instruction and register contents are displayed on a single line. When the program ends (executes HALT), the contents of any nonzero memory is displayed. Numeric arguments can either be written in decimal or hexadecimal with a leading "x" (as in 255 or xff). The output is displayed in the same base as the input. The command is useful for debugging configuration files.

u

Update the archive to make it consistent with external files, and then optionally add files as with a. For each internal file, if the external file does not exist or is different, then delete the internal file, leaving the order otherwise unchanged. If the external file is different, then append a compressed copy of it to the end of the archive.

Update fails for solid archives (created with -bs) if any internal file to be replaced or deleted is followed in the same block by a file not to be replaced or deleted.

x archive[.zpaq]
x archive[.zpaq] output/
x archive[.zpaq] output
x archive[.zpaq] output file

Extract. The command has several forms. In the first form, the contents of the archive are compared with the external files of the same name. If the external file exists, then it is compared and listed as either identical or different. In either case, the external file is not modified. If the external file does not exist, then it is created. If the first segment of the archive is not named (as created with c), then the external file name is taken to be archive (without the .zpaq extension). Directories are created as needed. Backslash (\) characters in filenames are converted to forward slashes (/) in Linux, and vice versa in Windows.

If output/ is specified with a trailing slash or backslash, then internal files are first renamed by removing any path (up to and including the last slash in the file name) and replacing it with output/. The effect is to compare or extract all of the files to the contents of the directory output.

If output does not have a trailing slash, then the entire contents of the archive are decompressed, concatenated, and written to the file output. output is overwritten if it exists unless the file is identical.

If file is specified, then only file is extracted to output, rather than the entire archive. If the archive does not contain a matching file, then output is not changed.


OPTIONS

-b{N|s}

During compression, split large files into blocks of size N MB. -b0 means to compress using one block per file. -bs means to compress all of the input files into one block (solid mode). For example, -b0.25 specifies blocks of 250,000 bytes.

For compression levels -m1 and -m2, the default block size is -b16 or 16 MB. The largest block size allowed is -b268.435199. -b0, -bs, or any larger size will automatically select this value. For levels -m3, -m4, and configuration files, the default is -b0.

Generally, larger blocks compress better but slower because fewer blocks are available to run in parallel on separate processors, and (for -m1 and -m2) require more memory to compress and decompress. -bs will give the best compression if there is any mutual information between files. However, archives created with -bs cannot be updated because it is not possible to delete files from the beginning or middle of a block without deleting all of the files after it. Extraction of single files will be slower because it is necessary to decompress and discard the output of earlier files in the same block.

-f

Force overwrite of output files during extraction when the file differs. Normally, the output is overwritten if an output file is specified.

-h

With r or t, -h means to run or trace the HCOMP section of the configuration file specifed by -m, rather than the PCOMP section. Also with r, do not run with an extra EOF at the end of input. The HCOMP section computes model contexts. Configuration files are described below.

-m{1|2|3|4|config[,N]...}

Select compression level from -m1 (fastest) to -m4 (best), or use a custom algorithm described in the configuration file config.cfg with up to 9 comma-separated numeric arguments. The default is -m1. The following table shows memory usage and equivalent configurations for the 4 built in compression levels.

    Level  Config     Memory per thread  Algorithm
    -----  ------     -----------------  ---------
    -m1    -mbwtrle1  5x block*          BWT + RLE + CM1
    -m2    -mbwt2     5x block*          BWT + CM2
    -m3    -mmid      111 MB             CM8
    -m4    -mmax      246 MB             CM22
    *block size is rounded up to a power of 2

By default, -m1 and -m2 use 80 MB memory per thread because the default block size is 16 MB.

-n

Do not save block header locator tags, comments (file sizes) or checksums. This saves a few bytes for the smallest possible archive. However, incremental update and error recovery will not work.

-q

Do not test the post-processor when compressing with a configuration file that uses one. The default is to run the preprocessed input (created with a user supplied preprocessor) through the post-processor specified in the configuration file and compare the SHA-1 checksum of the output with the original input and refuse to compress if they do not match.

-r

Recurse subdirectories. For each directory specified on the command line, compress all of the files and directories in it. Otherwise directories are ignored. Also, in the Linux version, the program will compress only regular files, ignoring symbolic links, devices, named pipes, sockets, and other special file types.

-tN

Use N threads during compression or decompression. The default is the number of processors detected. Using fewer threads can reduce memory usage but run slower. Using more threads than processors or blocks does not run any faster.

-v

Verbose mode. This is only useful for debugging the source code.


EXIT STATUS

Returns 0 if successful or 1 in case of an error.


CONFIGURATION FILES

Context Mixing Theory

All ZPAQ compliant programs use a configurable compression algorithm based on the PAQ model of bitwise context modeling plus optional pre- and post-processing. The ZPAQ standard defines the model precisely.

A model treats the data as a stream of bits which are predicted and arithmetic coded one at a time. The compression ratio depends on the accuracy of the predictions or probability assignments. When the next bit is assigned a probability p that it will be a 1, it means that if the bit is actually 1, it will be coded at a cost of log2(1/p) bits, and if it is a 0 it will be coded at a cost of log2(1/(1-p)). ZPAQ codes bits in MSB (most significant bit) to LSB (least significant bit) order, although either order would work.

During decompression, a model makes an identical series of predictions based on the previously decoded output. Thus, the same model is used for both compression and decompression, and only needs to be specified once. This differs from preprocessing and postprocessing, which should be inverses of each other. A ZPAQ archive only saves the postprocessing code. It is up to the compressor to verify its correctness.

ZPAQ (and PAQ) achieve a high compression ratio by using many independent context models to estimate p and then combining the estimates by weighted averaging and possibly applying further context-sensitive refinements. These components can be put together like building blocks. Generally there is a 3 way tradeoff between compression ratio, speed, and memory. Using more components usually results in better compression but takes longer to execute. Allocating more memory to components allows more statistics to be collected, which improves compression.

A context model component takes a context as input and outputs a prediction. After the predicted bit is learned, the model is adjusted to reduce the prediction error. It is up to the model to select useful contexts. For example, for text compression, the useful contexts are order-n (the last n whole bytes) and the last whole word or two. For images, a useful context might be the high order bits of some neighboring pixels.

Compression can sometimes be improved by preprocessing and postprocessing. For example, text can sometimes be compressed better by using a dictionary to encode common words using 1 or 2 byte codes before compression. The postprocessor would expand the codes back into words. An image preprocessor might predict pixels using some weighted average of earlier nearby pixels (perhaps dynamically tuning the weights to reduce prediction errors) and then compress the differences with a simple order-0 model. The postprocessor would make an identical set of predictions and add them to the decompressed differences.

The user is responsible for making sure that preprocessing followed by postprocessing will restore the original data exactly, but zpaq will still check before compressing.

Types of components

ZPAQ uses several component types. All of them take a context as input and output a prediction. Some also take the outputs of other components as inputs.

A simple context model (CM) computes a prediction by counting bits. For example, a sequence "00001" in some context would assign p = about 1/5 (or more precisely, 1.5/6 by starting both counts at 0.5). The prediction is achieved indirectly by starting with a prediction of p = 1/2 and a count of 0, and then adjusting p to reduce the error in inverse proportion to the count.

However, such a prediction might not be correct because the bits in the sequence might not be independent. A direct context map effectively gives higher weights to more recent history (so p > 1/5) by placing an upper bound on the count. A smaller bound makes the model more adaptive, but could also make compression worse if the input data has uniform statistics so that the bits really are independent.

A more general solution is to use an indirect context model (ICM) to predict based on what happened when the same bit sequence occurred in other contexts. This is achieved by mapping a context to a bit history table, and then mapping the bit history to a prediction. The bit history is updated by appending the next bit, and the prediction is updated to reduce the prediction error, usually at a small, constant rate (like 0.001 times the error). To save space, the bit history is represented as an 8 bit state representing a count, a ratio of zeros to ones, and the value of the last bit. The count is bounded to a small value. The initial prediction is set to the ratio.

A mixer takes a set of predictions from other components and combines them by weighted averaging. On update, the mixer reduces its output error by adjusting the weights to favor the input models that made better predictions. Weighted averaging occurs in the logistic domain, log(p/(1-p)), which gives greater weight to high confidence predictions, which are those near 0 or 1. The initial weights for an n-input mixer are 1/n.

It is sometimes advantageous to select from a set of weight vectors using a low order context. Compression can often be improved further (at a cost in speed) by using a hierarchy of mixers taking different contexts.

A prediction from a mixer (or the last mixer in a hierarchy) can be further refined using secondary symbol estimation (SSE). An SSE picks the output prediction from an array of 32 simple context models (CM) by interpolating between the two nearest quantized input prediction values.

An indirect SSE (ISSE) works like an SSE except that it uses bit histories as contexts like an ICM to save memory. This makes it appropriate for higher order contexts. It is possible to model directly using a chain of ISSE in increasing order starting with an order 0 CM or ICM as the first component in the chain. Unline an SSE, an ISSE adjusts the input probability by mixing it with a fixed constant using the bit history to select the pair of mixing weights. Initially it assigns a weight of 1 to the input prediction and 0 to the constant.

A MATCH component predicts bits by looking for the most recent occurrence of the current (usually high order) context and predicting the next bit following the match. The weight of the prediction is proportional to the length of the match, i.e. directly proportional in the logistic domain, and negative if the predicted bit is 0. Matches are found by saving past data in a history buffer and using a hash table of pointers into the buffer indexed by a context hash. If no match is found, then the prediction is 1/2, or 0 in the logistic domain.

Context computation

ZPAQ uses a program written in a langauge called ZPAQL to compute 32-bit contexts or context hashes and save them in an array. Each array element then becomes the context for one component. As a speed optimization, the program is called only once per byte rather than once per bit, with the last data byte as input. This means that the computed context must be combined with the next 0 to 7 data bits. The method of combining the computed and bitwise contexts depends on the component. Generally, only the low order bits of the context are used, depending on the size of the component.

For indirect models (ICM and ISSE), bit histories are stored in a hash table that is looked up on nibble (4 bit) boundaries. Each table element contains an 8 bit checksum confirmation to detect (most) hash collisions, and an array of 15 bytes representing bit histories for each of the 15 bitwise contexts that result from appending 0 to 3 more bits of context. Thus, the table is looked up twice per byte. The first lookup uses the low bits of the computed hash and the next higher 8 bits as a hash confirmation. The second lookup adds 16 x (16 + nibble) to it. The hash table looks up 3 adjacent values within the same 64 byte cache line and uses least frequeny used (LFU) replacement based on the count represented by the nibble boundary bit history if a matching checksum is not found.

For a mixer or SSE, a 1 is appended to the beginning of the partial byte context to create an 8 bit value, which is added to the computed context. For example, if the first 5 bits of the current byte are xxxxx, then the binary number 1xxxxx is added to the context to predict the next bit.

For a CM, the bitwise context is expanded to a 9 bit value by splitting it into two nibbles, appending a 1 bit to the partial low nibble, and setting the first bit to indicate that the context includes two nibbles. For example, xx is expanded to 0000001xx, and xxxxxx is expanded to 1xxxx01xx. Then this context is XORed with the computed context. This expansion reduces cache misses to improve speed by causing four consecutive contexts to fall within the same 64 byte cache line.

Config file format

The compression and decompression algorithm is described in a configuration file, which must have a .cfg extension. The decompression algorithm is stored in the ZPAQ archive. The configuration file is only needed during compression. It has 3 parts:

COMP - a description of a sequence of bit predictors. Each component takes a context and earlier predictions as input, and outputs a prediction for the next bit. The last component's prediction is output to the arithmetic coder.

HCOMP - a program that is called once for each byte of uncompressed data with that byte as input, and outputs an array of 32-bit contexts, one for each component in the COMP section. The program is written in ZPAQL.

POST/PCOMP - an optional pair of programs to preprocess the input before compression and postprocess the output after decoding for decompression. POST indicates no pre- or postprocessing. The model described by COMP and HCOMP sees a 0 byte followed by a concatenation of the uncompressed files. During decompression, the leading 0 indicates no postprocessing.

PCOMP describes an external program to preprocess the input files during compression, and a ZPAQL program to perform the reverse conversion to restore the original input. The compression model described in the COMP and HCOMP sections sees a 1 as the first byte to indicate that the decoded data should be postprocessed before output. This is followed by a 2 byte program length (LSB first), the ZPAQL postprocessor code, and a concatenation of the preprocessed input files. The PCOMP code sees just the preprocessed files as input, each ending with EOS (-1).

The preprocessor is an external program. It is not needed for decompression so it is not saved in the archive. It expects to be called with an input filename and output filename as its last 2 arguments.

The postprocessor is a ZPAQL program that is called once for each input byte and once with input EOS (-1) at the end of each segment. The program is initialized at the beginning of a block but maintains state information between segments within the same block. Its input is from a temporary file created by the preprocessor during compression testing and from the decoder during decompression.

The configuration file has the following format:

    COMP hh hm ph pm n
      (n numbered component descriptions)
    HCOMP
      (program to generate contexts, memory size = hh, hm)
    POST (for no pre/post procesing)
      0
    END

Or (for custom pre/post processing):

    COMP hh hm ph pm n
      (...)
    HCOMP
      (...)
    PCOMP preprocessor-command ;
      (postprocessor program, memory size = ph, pm)
    END

Configuration files are free format (all white space is the same) and mostly not case sensitive. They may contain comments in ((nested) parenthesis).

Components

The COMP section has 5 arguments (hh, hm, ph, pm, n) followed by a list of n components numbered consecutively from 0 through n-1. hh, hm, ph, and pm describe the sizes of the arrays used by the HCOMP and PCOMP virtual machines as described later. Each machine has two arrays, H and M. Their sizes are 2^hh and 2^hm respectively in HCOMP, and 2^ph and 2^pm in PCOMP. The HCOMP program computes the context for the n components by placing them in H[0] through H[n-1] as 32-bit numbers. Thus, hh should be set so that 2^hh >= n. For example, in mid.cfg, there are n = 8 components, so hh = 3, allowing up to 8 contexts. Larger values would work but waste memory. Memory usage is 2^(hh+2) + 2^hm + 2^(ph+2) + 2^pm bytes.

Each component outputs a "stretched" probability in the form ln(p(1)/p(0)). where p(1) and p(0) are the model's estimated probabilities that the next bit will be a 1 or 0, respectively. Thus, negative numbers predict a 0 and positive predict 1. The magnitude is the confidence of the prediction. The output is a number in the range -32 to 32 with precision 1/64 (a 12 bit signed number).

The 9 possible component types are shown below. All component parameters are numbers in the range 0..255. The component numbers i must be consecutive from 0 to n-1.

i CONST c (constant)

Output is a constant (c-128)/16. Thus, numbers larger than 128 predict 1 and smaller predict 0, regardless of context. CONST is very fast and uses no memory.

i CM s limit (context model)

Outputs a prediction by looking up the context in a table of size 2^s using the s low bits of the H[i] (for component i) XORed with a 9 bit expansion of the previously coded (high order) bits of the current byte. (Recall that H[i] is updated once per byte). Each table entry contains a prediction p(1) and a count in the range 0..limit*4 (max 1020). The prediction is updated in proportion to the prediction error and inversely proportional to the count. A large limit (max 255) is best for stationary sources. A smaller value makes the model more adaptive to changing statistics. Memory usage is 2^(s+2) bytes. s must not exceed 26 (4 GiB memory).

i ICM s (indirect context model)

Outputs a prediction by looking up the context in a hash table of size 64 * 2^s bit histores (1 byte states). The histories index a second table of size 256 that outputs a prediction. The table is updated by adjusting the prediction to reduce the prediction error (at a slow, fixed rate) and updating the bit history. The hash table is indexed by the low s+10 bits of H[i] and the previous bits of the current byte, with highest 8 bits (s+9..s+2) used to detect hash collisions. An ICM works best on nonstationary sources or where memory efficiency is important. It uses 2^(s+6) bytes. s must not exceed 26 (4 GiB memory).

i MATCH s b

Outputs a prediction by searching for the previous occurrence of the current context in a history buffer of size 2^b bytes, and predicting whatever bit came next, with a confidence proportional to the length of the match. Matches are found using an index of 2^s pointers into the history buffer, each of which points to the previous occurrence of the current context. A MATCH is useful for any data that has repeat occurrences of strings longer than about 6-8 bytes within a window of size 2^b. Generally, larger b (up to the file size) gives better compression, and s = b-2 gives adequate indexing. The context should be a high order hash. Memory usage is 4*2^s + 2^b bytes. s and b must not exceed 32 each (20 GiB memory).

i AVG j k wt (fixed weight 2 input mixer)

Averages the predictions of components j and k (which must precede i, the current component). The average is weighted by wt/256 for component j and 1 - wt/256 for component k. Often, averaging two predictions gives better results than either prediction by itself. wt should be selected to favor the more accurate component. AVG is very fast and uses no memory.

i MIX2 s j k rate mask (2 input mixer)

Averages the predictions of components j and k (which must precede i, the current component) adaptively. The weight is selected from a table of size 2^s by the low s bits of H[i] added to the masked, previously coded bits of the current byte (an 8 bit value). A mask of 255 includes the current byte, and a mask of 0 excludes it. (Other masks are rarely useful). The adaptation rate is selectable. Typical values are around 8 to 32. Lower values are best for stationary sources. Higher rates are more adaptive. A MIX2 generally gives better compression than AVG but at a cost in speed and memory. Uses 2^(s+1) bytes of memory. s must not exceed 32 (8 GiB memory).

i MIX s j m rate mask (m-input mixer)

A MIX works like a MIX2 but with m inputs over a range of components j..j+m-1, all of which must precede i. A typical use is as the final component, taking all other components as input with a low order context. A MIX with 2 inputs is different than a MIX2 in that the weights are not constrained to add to 1. This sometimes gives better compression, sometimes worse. Memory usage is m*2^(s+2) bytes and must not exceed 16 GiB. Execution time is proportional to m.

i ISSE s j (indirect secondary symbol estimator)

An ISSE takes a prediction and a context as input and outputs an adjusted prediction. The low s+10 bits of H[i] and the previous bits of the current byte index a hash table of size 2^(s+6) bit histories as with an ICM. The bit history is used as an 8 bit context to select a pair of weights for a 2 input MIX (not a MIX2) with component j (preceding i) as one input and a CONST 144 as the other. The effect is to adjust the previous prediction using a (typically longer) context. A typical use is a chain of ISSE after a low order CM or ICM working up to higher order contexts as in mid.cfg. (This architecture is also used in the PAQ9A compressor). Uses 2^(s+6) bytes. s must not exceed 26 (4 GiB memory).

i SSE s j start limit (secondary symbol estimator)

An SSE takes a predicion and context as input (like an ISSE) and outputs an adjusted prediction. The mapping is direct, however. The input from component j and the context are mapped to a 2^s by 64 CM table by quantizing the prediction to 64 levels and interpolating between the two nearest values. The context is formed by adding the partial current byte to the low s bits of H[i]. The table is updated in proportion to the prediction error and inversely proportional to a count as with a CM. The count is initialized to start and has the range (start..limit*4). A large limit is best for stationary sources. A smaller limit is more adaptive. The starting count does not start at 0 because the table is initialized so that output predictions are the same as input predictions regardless of context. If the initial guess is close, then a higher start value works better.

An SSE sometimes gives better compression than an ISSE, especially on stationary sources where a CM works better than an ICM. But it uses 2^12 times more memory for the same context size, so it is useful mostly for low order contexts. A typical use is to adjust the output of a MIX. It is sometimes followed by an AVG to average its input and output, typically weighted 75% to 90% in favor of the output. Sometimes more than one SSE or SSE-AVG pair is used in series with progressively higher order contexts, or may be used in parallel and mixed. An SSE uses 2^(s+8) bytes. s must not exceed 26 (16 GiB memory).

All components are designed to work with context hashes that are uniformly distributed over the low order bits (depending on the s parameter for that component). A CM, MIX2, MIX, or SSE may also be used effectively with direct context lookup for low orders. In this case, the low 9 bits of a CM or low 8 bits of the other components should be cleared to leave space to combine with the bits of the current byte. This is summarized:

    Component              Context size    Memory          Limit
    -------------------    ------------    ------          ------
    CONST c                0               0
    CM s limit             s               2^(s+2)         16 GiB
    ICM s                  s+10            2^(s+6)         4  GiB
    MATCH s b              s               2^(s+2) + 2^b   20 GiB
    AVG j k wt             0               0
    MIX2 s j k rate mask   s               2^(s+1)         8  GiB
    MIX s j m rate mask    s               m*2^(s+2)       16 GiB
    ISSE s j               s+10            2^(s+6)         4  GiB
    SSE s j start limit    s               2^(s+8)         16 GiB

In addition to the specified limits per component, this program will not create arrays 2 GiB (2^31) or larger when compiled for 32 bit operating systems.

ZPAQL

There are one or two ZPAQL programs in a configuration file. The first, HCOMP, describes a program that computes the context hashes. The second, PCOMP, is optional. It describes the code that inverts any preprocessing performed by an external program prior to compression. The COMP and HCOMP sections are stored in the block headers uncompressed. PCOMP, if used, is appended to the start of the input data and compressed along with it.

Each virtual machine has the following state:

    4 general purpose 32 bit registers, A, B, C, D.
    A 1 bit flag register F.
    A 16 bit program counter, PC.
    256 32-bit registers R0 through R255.
    An array of 32 bit elements, H, of size 2^hh (HCOMP) or 2^ph (PCOMP).
    An array of 8 bit elements, M, of size 2^hm (HCOMP) or 2^pm (PCOMP).

Recall that the first line of a configuration file is:

    COMP hh hm ph pm n

HCOMP is called once per byte of input to be compressed or decompressed with that byte in the A register. It returns with context hashes for the n components in H[0] through H[n-1].

PCOMP is called once per decompressed byte with that byte in the A register. At the end of a segment, it is called with EOS (-1) in A. Output is by the OUT instruction. The output should be the uncompressed data exactly as it was originally input prior to preprocessing. H has no special meaning.

All state variables are initialized to 0 at the beginning of a block. State is maintained between calls (and across segment boundaries) except for A (used for input) and PC, which is reset to 0 (the first instruction).

The A register is used as the destination of most arithmetic or logical operations. B and C may be used as pointers into M. D points into H. F stores the result of comparisons and is used to decide conditional jumps. R0 through R255 are used for auxilary storage. All operations are modulo 2^32. All array index operations are modulo the size of the array (i.e. using the low bits of the pointer). The instruction set is as follows:

    - Y=Z (assignment)
      - where Y is A B C D *B *C *D
      - where Z is A B C D *B *C *D (0...255)
    - AxZ (binary operations)
      - where x is += -= *= /= %= &= &~ |= ^= <<= >>= == < >
      - where Z is as above.
    - Yx (unary operations)
      - where Y is as above
      - where x is <>A ++ -- ! =0
      - except A<>A is not valid.
    - J N (conditional jumps)
      - where J is JT JF JMP
      - where N is a number in (-128...127).
    - LJ NN (long jump)
      - where NN is in (0...65535).
    - X=R N (read R array)
      - where X is A B C D
      - where N is in (0...255).
    - R=A N (write R array)
      - where N is in (0...255).
    - ERROR
    - HALT
    - OUT
    - HASH
    - HASHD

All instructions except LJ are 1 or 2 bytes, where the second byte is a number in the range 0..255 (-128..127 for jumps). A 2 byte instruction must be written as 2 tokens separated by a space, e.g. "A= 3", not "A=3" or "A = 3". The exception is assigning 0, which has a 1 byte form, "A=0".

The notation *B, *C, and *D mean M[B], M[C], and H[D] respectively, modulo the array sizes. For example "*B=*D" assigns M[B]=H[D] (discarding the high 24 bits of H[D] because M[B] is a byte).

Binary operations always put the result in A.

+=, -=, *=, &=, |=, ^=, = have the same meanings as in C/C++.

/=, %= have the result 0 if the right operand is 0.

A&~B means A &= ~B;

A<<=B, A>>=B mean the same as in C/C++ but are explicitly defined when B > 31 to mean the low 5 bits of B.

<, >, == compare and put the result in F as 1 (true) or 0 (false). Comparison is unsigned. Thus PCOMP would test for EOS (-1) as "A> 255". There are no !=, <=, or <= operators.

B<>A means swap B with A. A must be the right operand. "A<>B" is not valid. When 32 and 8 bit values are swapped as in "*B<>A", the high bits are unchanged.

++ and -- increment and decrement as in C/C++ but must be written in postfix form. "++A" is not valid. Note that "*B++" increments *B, not B.

! means to complement all bits. Thus, "A!" means A = ~A;

JT (jump if true), JF (jump if false), and JMP (jump) operands are relative to the next instruction in the range -128..127. Thus "A> 255 JT 1 A++" increments A not to exceed 256. A jump outside the range of the program is a run time error.

LJ is a long jump. It is 3 bytes but the operand is written as a number in the range 0..65535 but not exceeding the size of the program. Thus, "A> 255 JT 3 LJ 0" jumps to the beginning of the program if A <= 255.

The R registers can only be read or written, as in "R=A 3 B=R 3" which assigns A to R3, then R3 to B. These registers can only be assigned from A or to A, B, C, or D.

ERROR causes an error like an undefined instruction, but is not reserved for future use (possibly in ZPAQ level 2) like other undefined instructions.

HALT causes the program to end (and compression to resume). A program should always execute HALT.

OUT in PCOMP outputs the low 8 bits of A as one byte to the file being extracted. In HCOMP it has no effect.

HASH is equivalent to A = (A + *B + 512) * 773; HASHD is equivalent to *D = (*D + A + 512) * 773; These are convenient for computing context hashes that work well with the COMP components. They are not required, however. For example, "A+=*D A*= 12 *D=A" updates a rolling order s/2 context hash for an s-bit wide component pointed to by D. In general, an order ceil(s/k) hash can be updated by using a multiplier which is an odd multiple of 2^k. HASH and HASHD are not rolling hashes. They must be computed completely for each context. HASH is convenient when M is used as a history buffer.

In most programs it is not necessary to code jump instructions. ZPAQL supports the following structured programming constructs:

    IF ... ENDIF              (execute ... if F is true)
    IF ... ELSE ... ENDIF     (execute 1st part if true, 2nd if false)
    IFNOT ... ENDIF           (execute ... if F is false)
    IFNOT ... ELSE ... ENDIF  (execute 1st part if false, 2nd if true)
    DO ... WHILE              (loop while true (test F at end))
    DO ... UNTIL              (loop while false)
    DO ... FOREVER            (loop forever)

These constructs may be nested 1000 deep. However IF statements and DO loops nest independently and may be crossed. For example, the following loop outputs a 0 terminated string pointed to by *B by breaking out when it finds a 0. The construct DO IF...FOREVER ENDIF is equivalent to a "while" loop.

    DO
      A=*B A> 0 IF (JF endif)
        OUT B++
      FOREVER (JMP do)
    ENDIF

IF, IFNOT, and ELSE are coded as JF, JT and JMP respectively. They can only jump over at most 127 instructions. If the code in these sections are longer, then use the long forms IFL, IFNOTL, or ELSEL. These behave the same but are coded using LJ instead. There are no special forms for WHILE, UNTIL, or FOREVER. The compiler will automatically use the long forms when needed.

Parameters

In a config file, paramaters may be passed as $1, $2, ..., $9. These are replaced with numeric values passed on the command line. For example, -mmax,3,4 would have the effect of replacing $1 with 3 and $2 with 4. The default value is 0, i.e. $3 through $9 are replaced with 0.

In addition, a parameter may have the form $N+M, where N is 1 through 9 and M is a number. The effect is to add M. For example, $2+10 would be replaced with 14. Parameters may be used anywhere in the config file where a number is allowed.

Pre/Post processing

The PCOMP/POST section has the form:

    POST 0 END

to indicate no preprocessing or postprocessing, or

    PCOMP preprocessor-command ;
      (postprocessing code)
    END

to preprocess with an external program and to invert the transform with postprocessing code written in ZPAQL. The preprocessing command must end with a space followed by a semicolon. The command may contain spaces or options. The program is expected to take as two additional arguments an input file and an output file. ZPAQ will call the program by passing it the input file and a temporary file. Then it will (unless prevented by -q) run the temporary file through the postprocessing code and verify that its SHA-1 checksum matches the input. If so, then it compresses the temporary file in a second pass. If the input file is divided into blocks, then the program will create temporary files of the appropriate sizes and pass them to the preprocessor.


COMPRESSION ALGORITHMS

The four built-in compression levels -m1 through -m4 have corresponding configuration files bwtrle1.cfg, bwt2.cfg, mid.cfg, and max.cfg respectively, shown below.

-m1

bwtrle1 preprocesses the input using a Burrows-Wheeler transform (BWT) followed by run length encoding (RLE). A BWT divides the input into blocks (with size selected by -b), appends an EOF (-1), and sorts the characters by their right context. This transform brings together bytes with similar contexts, so it tends to produce long runs of identical values which are easily compressed. The BWT is reversible as long as the location of the EOF symbol is known. Because -1 cannot be stored as a byte, it is replaced with a 255 and its location is appended to the end as a 4 byte little-endian (LSB first) index. The stream is then RLE encoding, such that sequences of 2 to 257 identical bytes are replaced with the first two bytes and a count of the remaining bytes. Next, the encoded stream is compressed with an order 0 ICM which uses the RLE state (literal or count) as context.

The code below takes one argument to control the block size. zpaq generates code that effectively selects the argument $1 to choose the smallest block (least memory) possible. The block size must be a power of 2 and at least 257 bytes larger than the input size. It calls the external program bwtrle to perform the BWT and RLE encoding.

The HCOMP section computes the RLE state and stores it in H[0] as context to the ICM. The code is called with the current input byte in A. D is initialized to 0 so it always points to H[0]. C contains the previous input byte. B contains the RLE state, a count of how many times the last two bytes were the same. If B is 2, then A contains a count. If B is 0 or 1 then it contains a literal. The last line of code before HALT stores a hash of B in H[0]. During modeling, the leading bits of the current byte are added to this hash to compute the context.

The PCOMP section receives decoded bytes and a final EOF (-1) in A. Before the EOF, it receives each byte, expands runs, and stores the result in the M array. It uses B as a pointer to the next byte in M to be written, and C as a temporary copy of the input byte. It uses D as the RLE state. It is initially 0. After the first byte, it is set to A+1 (1..256). If the second byte is a match, then it is set to A+257 (257..512) to indicate that a count is expected next. When the count is received, D-257 is stored the indicated number of times.

When EOF is input (A>255 as an unsigned), the program computes the inverse BWT and outputs it. The generic inverse BWT algorithm of buf[0..n-1], buf[idx]==EOF outputs n-1 bytes (no EOF) as follows:

    unsigned char buf[n];  // input BWT
    unsigned int idx;      // input location of EOF in BWT
    FILE* out;             // output
    unsigned int count[256]={0};  // Cumulative byte counts
    unsigned int list[n]={0}      // Circular linked list
    // Count bytes
    for (size_t i=0; i<n; ++i)
      ++count[(buf[i]+1)&255];
    // Cumulative counts including EOF
    count[0]=1;
    for (size_t i=1; i<256; ++i)
      count[i]+=count[i-1];
    // Build linked list including EOF at idx
    for (size_t i=0; i<idx; ++i)
      list[count[buf[i]]++]=i;
    for (size_t i=idx+1; i<n; ++i)
      list[count[buf[i]]++]=i;
    // Scan linked list and output n-1 bytes
    for (size_t p=idx; p;) {
      p=list[p];
      putc(buf[p], out);
    }

In the equivalent ZPAQL code, buf[0..n-1] and idx are appended in M with b == n+4. The first section reads idx into C and R1 and leaves n in B and R2. The next two loops store the cumulative byte counts in the last 257 elements of H in reverse order (H[-1..-256]). The remaining elements of H are used for the linked list. The list has to be built in 2 parts because buf[idx] contains 255 rather than -1. The final loop outputs n-1 bytes.

    (BWT + RLE + order 0 model.
    Requires 80 MiB memory for files up to 16 MiB - 257 bytes.
    Use argument 1 to double, -1 to halve)
    comp 1 0 $1+24 $1+24 1
      0 icm 7
    hcomp
      a==c c=a if
        b++
      else
        b=0
      endif
      a=b *d=0 hashd
      halt
    pcomp bwtrle e ;
      (read BWT and index into M[0..b-1] with RLE decoding:
       X X n -> n+2 copies of X)
      a> 255 ifnot
        c=a (copy of input byte)
        a=d a== 0 if (d=0 = initial RLE state: write c and save c+1 in d)
          d=c d++
          *b=c b++
        else
          a=d a-- a> 255 ifnot (d=1..256 = last byte + 1)
            a==c if
              a+= 255 a++ a++ d=a (d=257..512 = RLE count is next)
            else
              d=c d++
            endif
            *b=c b++
          else (d=257..512 = write c repeats of d-257)
            d--
            do a=c a> 0 if
              *b=d b++ c--
            forever endif
            d=0 (reset RLE state)
          endif
        endif
      (inverse BWT)
      else
        
        (index in last 4 bytes, put in c and R1)
        b-- a=*b
        b-- a<<= 8 a+=*b
        b-- a<<= 8 a+=*b
        b-- a<<= 8 a+=*b c=a r=a 1
        (save size in R2)
        a=b r=a 2
        (count bytes in H[~1..~255, ~0])
        do
          a=b a> 0 if
            b-- a=*b a++ a&= 255 d=a d! *d++
          forever
        endif
        (cumulative counts: H[~i=0..255] = count of bytes before i)
        d=0 d! *d= 1 a=0
        do
          a+=*d *d=a d--
        d<>a a! a> 255 a! d<>a until
        (build first part of linked list in H[0..idx-1])
        b=0 do
          a=c a>b if
            d=*b d! *d++ d=*d d-- *d=b
          b++ forever
        endif
        (rest of list in H[idx+1..n-1])
        b=c b++ c=r 2 do
          a=c a>b if
            d=*b d! *d++ d=*d d-- *d=b
          b++ forever
        endif
        (traverse list and output)
        d=r 1 do
          a=d a== 0 ifnot
            d=*d
            b=d a=*b out
          forever
        endif
      endif
      halt
    end
-m2

bwt2 works like bwtrle1 except that the RLE step is skipped and the context model has two components. These both improve compression at a cost in speed. The order 0 ICM output is refined through an order 1 ISSE for better compression. In the code below, the inverse BWT is identical in the POST section, but the RLE decoding prior to EOF is omitted. The HCOMP section does not compute any context for component 0 (because there is no RLE state) but uses the current byte for component 1.

    comp 1 0 $1+24 $1+24 2
      0 icm 5
      1 isse 12 0
    hcomp
      d= 1 *d=0 hashd halt
    pcomp bwtrle c ;
    
      (read BWT, index into M, size in b)
      a> 255 ifnot
        *b=a b++
    
      (inverse BWT)
      else
    
        (index in last 4 bytes, put in c and R1)
        b-- a=*b
        b-- a<<= 8 a+=*b
        b-- a<<= 8 a+=*b
        b-- a<<= 8 a+=*b c=a r=a 1
    
        (save size in R2)
        a=b r=a 2
    
        (count bytes in H[~1..~255, ~0])
        do
          a=b a> 0 if
            b-- a=*b a++ a&= 255 d=a d! *d++
          forever
        endif
    
        (cumulative counts: H[~i=0..255] = count of bytes before i)
        d=0 d! *d= 1 a=0
        do
          a+=*d *d=a d--
        d<>a a! a> 255 a! d<>a until
    
        (build first part of linked list in H[0..idx-1])
        b=0 do
          a=c a>b if
            d=*b d! *d++ d=*d d-- *d=b
          b++ forever
        endif
    
        (rest of list in H[idx+1..n-1])
        b=c b++ c=r 2 do
          a=c a>b if
            d=*b d! *d++ d=*d d-- *d=b
          b++ forever
        endif
    
        (traverse list and output)
        d=r 1 do
          a=d a== 0 ifnot
            d=*d
            b=d a=*b out
          forever
        endif
      endif
      halt
    end
-m3

mid uses no pre/post processing. It has 8 components: an order 0 through 5 ICM-ISSE chain, an order 7 MATCH, and a final order 1 mixer that combines all 7 other predictions. In the code, M is used as a rotating bufer containing the last 8 input bytes and C points to the head of the buffer. The code computes hashes of context orders 0, 1, 2, 3, 4, 5, 7, 1 and puts them in D[0..7] for those respective components. The mixer takes a 16 bit order-1 context which is not hashed. The last byte is placed in the high order bits and the leading bits of the current byte are placed in the 8 low order bits.

The order 0-5 ICM-ISSE chain components use increasingly large hash tables up to 2^(19+6) = 32 MiB memory for component 5. Each takes its input from the previous component. The MATCH uses an index of size 2^22 pointers and a buffer of size 2^24, which use 16 MiB each. The MIX has a 16 bit context, takes 7 inputs starting at component 0, a learning rate of 24, and a mask of 255 (no mask).

    comp 3 3 0 0 8 (hh hm ph pm n)
      0 icm 5        (order 0...5 chain)
      1 isse 13 0
      2 isse 17 1
      3 isse 18 2
      4 isse 18 3
      5 isse 19 4
      6 match 22 24  (order 7)
      7 mix 16 0 7 24 255  (order 1)
    hcomp
      c++ *c=a b=c a=0 (save in rotating buffer M)
      d= 1 hash *d=a   (orders 1...5 for isse)
      b-- d++ hash *d=a
      b-- d++ hash *d=a
      b-- d++ hash *d=a
      b-- d++ hash *d=a
      b-- d++ hash b-- hash *d=a (order 7 for match)
      d++ a=*c a<<= 8 *d=a       (order 1 for mix)
      halt
    post
      0
    end
-m4

max has 22 components. It has an order 0..5, 7 ICM-ISSE chain for general purpose compression, a MATCH, a whole word order 0-1 ICM-ISSE chain for modeling text, three order 1 ICM sparse models with gaps of 1 to 3 bytes, one ICM with a gap of 216 (one scan line) for modeling binary FAX images, and one fixed prediction for mixer bias. The two word contexts compute hashes of whole words by considering only sequences of the letters a through z, ignoring case. These predictions are mixed with a hierarchy of 3 mixers: one order 0, one order 1, and one context free 2 input mixer to combine them. This prediction is passed through two SSE stages (order 0 and 1) with an order 0 mixer and a context free mixer of the SSE inputs and outputs to allow the SSE to be partially bypassed adaptively. It does not use pre/post processing.

    comp 5 9 0 0 22 (hh hm ph pm n)
      0 const 160
      1 icm 5  (orders 0-6)
      2 isse 13 1 (sizebits j)
      3 isse 16 2
      4 isse 18 3
      5 isse 19 4
      6 isse 19 5
      7 isse 20 6
      8 match 22 $1+24
      9 icm 17 (order 0 word)
      10 isse 19 9 (order 1 word)
      11 icm 13 (sparse with gaps 1-3)
      12 icm 13
      13 icm 13
      14 icm 14 (pic)
      15 mix 16 0 15 24 255 (mix orders 1 and 0)
      16 mix 8 0 16 10 255 (including last mixer)
      17 mix2 0 15 16 24 0
      18 sse 8 17 32 255 (order 0)
      19 mix2 8 17 18 16 255
      20 sse 16 19 32 255 (order 1)
      21 mix2 0 19 20 16 0
    hcomp
      c++ *c=a b=c a=0 (save in rotating buffer)
      d= 2 hash *d=a b-- (orders 1,2,3,4,5,7)
      d++ hash *d=a b--
      d++ hash *d=a b--
      d++ hash *d=a b--
      d++ hash *d=a b--
      d++ hash b-- hash *d=a b--
      d++ hash *d=a b-- (match, order 8)
      d++ a=*c a&~ 32 (lowercase words)
      a> 64 if
        a< 91 if (if a-z)
          d++ hashd d-- (update order 1 word hash)
          *d<>a a+=*d a*= 20 *d=a (order 0 word hash)
          jmp 9
        endif
      endif
      (else not a letter)
        a=*d a== 0 ifnot (move word order 0 to 1)
          d++ *d=a d--
        endif
        *d=0  (clear order 0 word hash)
      (end else)
      d++
      d++ b=c b-- a=0 hash *d=a (sparse 2)
      d++ b-- a=0 hash *d=a (sparse 3)
      d++ b-- a=0 hash *d=a (sparse 4)
      d++ a=b a-= 212 b=a a=0 hash
        *d=a b<>a a-= 216 b<>a a=*b a&= 60 hashd (pic)
      d++ a=*c a<<= 9 *d=a (mix)
      d++
      d++
      d++ d++
      d++ *d=a (sse)
      halt
    post
      0
    end


ENVIRONMENT

Temporary files are placed in TEMP if defined, or else in TMPDIR if defined, or else in /tmp.

In Windows, the default number of threads (set by -t) is NUMBER_OF_PROCESSORS. In Linux, the number of lines of the form "Processor : 0", "Processor : 1",... in /cpu/procinfo is used instead.


STANDARDS

See the ZPAQ level 1 specification in section AVAILABILITY . It is anticipated that future levels (ZPAQ-2, ZPAQ-3, etc.) will be backward compatible, such that newer levels can read archives produced by older programs.


AVAILABILITY

http://mattmahoney.net/zpaq/


BUGS

There is no GUI.

ZPAQL is not very user friendly as a language because it is designed for high speed and compactness.

It is possible to add the same file with different names, such as file.txt, FILE.TXT, dir/../file.txt and so on. This can cause collisions during extraction with unpredictable results.

There are no commands to extract part of an archive bigger than one file and smaller than all of it.

Extraction might overwrite a file that has write permission but not read permission.

Timestamps and other file attributes are not preserved. If you need this, make a .tar.zpaq file.

Temporary files might not be deleted if the program is interrupted or if an error occurs. Temporary files have the form zpaqtmppid* in the temporary directory, where pid is the process ID.


SEE ALSO

See especially lrzip(1) which uses ZPAQ algorithm (although in an incompatible archive format).

bzip2(1) gzip(1) lrzip(1) lzop(1) lzma(1) p7zip(1) rzip(1) unace(1) unrar(1) unzip(1) unzaq(1) zip(1) libzpaq(3)


AUTHORS

zpaq uses code from libzpaq v4.00 and libdivsufsort-lite v2.01.

zpaq v4.03 is copyright (C) 2011, Dell Inc. It is written by Matt Mahoney. It is licensed under GPL v3, or at your option, any later version. For information on the license, see <http://www.gnu.org/copyleft/gpl.html>. Program was written by Matt Mahoney <matmahoney@yahoo.com>.

libzpaq v4.00 is copyright (C) 2011, Dell Inc. It is written by Matt Mahoney. It is licensed under a modified MIT license allowing unrestricted use. See the source code for license text.

libdivsufsort-lite v2.01 is copyright (C) 2003-2008, Yuta Mori. It is licensed under the MIT license. See the source code for license text.