Matt Mahoney
mmahoney@cs.fit.edu
Last update: Mar. 25, 2000
This FAQ is for Fast Text Compression with Neural Networks.
A predictive encoder has two parts (see below). The first part is the predictor: it assigns a probability to each character in the alphabet that it will occur next. The second part is the encoder. It peeks at the next character and assigns a code based on the probability assigned to it by the predictor. The idea is to assign short codes for the most likely characters, so that the output will be shorter on average.
Compression P(a) = .04
+-----------+ P(b) = .003 +---------+
the cat in th_ --> | Predictor | --> ... --> | Encoder | --> X
+-----------+ P(e) = .3 +---------+ |
... ^ |
e --+ |
|
+----------+
Decompression P(a) = .04 v
+-----------+ P(b) = .003 +---------+
the cat in th_ --> | Predictor | --> ... --> | Decoder | --> e
^ +-----------+ P(e) = .3 +---------+ |
| ... |
+---------------------------------------------------------+
During decompression, the predictor makes an identical series of
probability estimates for each character. But now the decoder does
the opposite of the encoder. It matches the code to what the encoder
would have assigned to each character in order to recover the
original character.
0 .7 .8 1
+-----------------------------------+
| a |b| c|d| e | ... | t | ..... |
+-----------------------------------+
/ \
/ \
.7 .74 .76 .8
+-----------------------------------+
| a |||| e || h | i |||| o |..|
+-----------------------------------+
/ \
/ \
.74 .746 .752 .76
+-----------------------------------+
| a || e || i | .. | "the" = .75 (11 in binary)
+-----------------------------------+
The final output is a number, expressed as a binary fraction, from
anywhere within the resulting range. In the example above, we could
pick 0.75 (.11 in binary) because it is between .746 and .752. Thus, the
steps to encode the are:
In our example, we have P(the) = P(t)P(h|t)P(e|th) = 0.1 x 0.2 x 0.3 = 0.006, and log2 1/0.006 = 7.38 bits. We can always find an 8 bit code within any range of width 0.006 because these numbers occur every 1/28, or about every 0.004.
The proof is easy for the simple case where all strings are equally likely. If there are n possible strings each with probability 1/n, then we would have to assign codes (in binary) of 0 through n-1. In binary, the number n-1 has é log2 n ù bits.
Samuel Morse developed one of the first compression codes when he assigned short codes like "." for e and "-" for t, and long codes like "--.." for z.
For instance, suppose we had to compress strings like acbcaccbbaac where a, b, and c each occur with probability 1/3. We might use the following code:
We could do better by coding two characters at a time:
Arithmetic encoding assigns a single code to the entire string, so no more than one bit is wasted. If we have n characters in string x, then P(x) = 1/3n, and the best we can do is 1/n log2 1/P(x) = log2 3 = 1.585 bpc.
Some loss of efficiency occurs because we have to round off probabilities in order to divide the range [x1, x2] on integer boundaries. In practice, we lose less than 0.0001 bpc.
For compressing text, solving the prediction problem (language modeling) implies passing the Turing test for artificial intelligence, a problem that has remained unsolved since Turing first proposed it in 1950. Shannon estimated (also in 1950) that the entropy of written English text (i.e. the best compression possible in theory) is between 0.6 and 1.3 bpc. (My research suggests 1 bpc). The best language model, one by Rosenfeld in 1996, achieved about 1.2 bpc on hundreds of megabytes of newspaper articles. On smaller files, 2 bpc is typical of good compressors, and 3 bpc for popular formats as zip, gzip, and compress.
The problem is even harder than this for general purpose compression. For example, suppose we take a 1 gigabyte file of zero bits and encrypt it with triple DES (or some other strong encryption) using the key "foobar". The output will appear to be a random string of 0's and 1's, and no known compression program (including mine) will obtain any compression on it whatsoever. Yet, I just described the file exactly in the second sentence of this paragraph, in a lot less than one gigabyte. A general purpose compression program should therefore do at least as well as my description, but that implies that it must be capable of breaking DES and every other known encryption system.
It gets worse. It can be proven that there is no compression algorithm that will compress every input string. If there were, then the compressed code could be compressed again, obtaining more compression. This could be repeated until the code had length 0. Obviously, this code could not be decompressed, since any string could have produced it.
In short, there is no best compression algorithm for all types of data.
The Turing test for AI was proposed in 1950. Turing stated that if a machine is indistinguishable from a human communicating through a terminal, then the machine has AI.
Humans are very good at ranking probabilities of text strings, for instance P(the big dog) > P(dog big the) > P(dgo gbi eht). This is also true of dialogues or transcripts, for instance, P(Q: What color are roses? A: red) > P(Q: What color are roses? A: blue). If a machine knew this distribution, it could generate responses to an arbitrary question Q such that the response A would have the distribution P(QA)/P(Q). By definition of P, this is exactly how a human would respond.
(Actually P varies among humans, but this works to the advantage of the machine. For details, see my paper Text Compression as a Test for Artificial Intelligence, 1999 AAAI Proceedings).
This method is known as PPM, or Prediction by Partial Match. Some of the best compression programs, rkive and boa, use a variant called PPMZ, developed by Bloom in 1997.
A neural network also resembles the one known working language model, the human brain. Developing a system like this might lead to a better understanding of how humans process language.
To make a prediction, such as P(e|th) (that th will be followed by e), all of the contexts that are present (1, h, and th) are set to 1, and all others to 0. (The input labeled "1", representing the 0-order context, is always on). Then the outputs are calculated as follows.
yj = g(Si
wi,jxi)
where g(x) = 1/(1 + e-x)
Then yi is the probability that character i is next. For example, P(e|th) = g(we + whe + wthe).
The g(x) function has a range (0, 1), so it always represents a valid probability. It is an increasing function. For instance:
wi,j ¬ wi,j +
nxiEj
where Ej = yj - P(j) is the error, the difference between the actual and expected next character, and the constant n is the learning rate.
Thus, in our example, if the next character is e, then we increase we, whe, and wthe, and decrease all of the other weights connected to the inputs 1, h, and th. Note how this increases the predicted probability of e the next time any of these contexts occur again.
We can do better than this. First, note that if only one input is active, then we can solve for the weights directly. If the desired output is p, then for the weight between them,
w = g-1(p) = log p/(1 - p)
If we let p = N(1)/N (the fraction of times that the output was 1 in the past), then we can get the same effect by setting the learning rate to (N(0) + N(1))/N(0)N(1). (See my paper for proof). This requires that we store along with each weight the counts N(0) and N(1), the number of times the output was a 0 or 1 when the input was active. (These are called c0 and c1 in p5.cpp and p6.cpp).
In practice we initialize the counts to 0.5 instead of 0 to avoid probabilities of 0 or 1 (and an initially infinite learning rate).
Since the inputs are neither fully independent or fully dependent, NL should be somewhere between 1 and 1/m. I found that values around 1/sqrt(m) (about 0.4 for m = 5) work well in practice for text compression.
that thatch thaw the theatre theft th_
then it will assign the same probability to a and e, even though the most recent statistics suggest that e is more likely.
NS + NL x (N(0) + N(1))/N(0)N(1)
The effect is to bias the statistics in favor of the last 1/NS occurrences of the context, and "forgetting" earlier examples.
Context
Length NS
1 0.08
2 0.15
3 0.2
4 0.2
5 0.2
In text, character frequencies tend to be constant (e is the
most common letter in nearly all forms of text). This is not true
for word-length contexts. Here, the recent occurrence of a word
makes it likely to appear again. This is called the "cache" effect.
This is a partially constrained problem over the joint distribution P(x1, x2, y). There is not enough information to solve the problem. We suspect that the probability will be high (as ME predicts), but in fact it could be zero without violating any of the given constraints.
For the more general case where we wish to find P(y|X) = P(y|x1, x2, ..., xn), the Maximum Entropy (ME) solution can be found by a process called Generalized Iterative Scaling (GIS), and has the form
P(y|X) = 1/Z exp(Si w0 + wixi)
where the wi are constants (weights) to be solved, and Z is a constant chosen to make the probabilities add up to 1. (See Rosenfeld 1996, or Manning and Schütze 1999).
GIS is essentially an algorithm for adjusting the weights, wi, until the constraints are met, similar to neural network training. In fact, when we solve for Z, we have our neural network equation.
P(y|X) = g(Si
w0 + wixi)
where g(x) = 1/(1 + e-x)
We don't always have to use GIS. For our original example, we can solve for the weights directly to confirm our intuition.
For the longer contexts, I used a slightly different method to avoid having to do arithmetic on numbers longer than 32 bits, but the effect is the same.
For the two byte context (65536 possibilities), I gave each context its own input unit, so no hash function was needed. For the longer contexts, I mapped them into the same space, so it is possible that two contexts of different length might share the same input unit.
Another possibility is to use a data structure that avoids collisions entirely, such as a tree or a bucket hash table. This would improve compression on small files, but are not a solution, since all such data structures would eventually fill up or exhaust memory.
Bit prediction is otherwise equivalent to character predition. For instance, the ASCII code for e is 01100101, so we can estimate its probability as the product of the probabilities for each bit:
P(e|th) = P(0|th)P(1|th,0)P(1|th,01)P(0|th,011)P(0|th,0110)P(1|th,01100)P(0|011001)P(1|th,0110010).
There is a separate input unit for each of these 8 contexts.
Bloom, Charles, ppmz v9.1 (1997), http://www.cco.caltech.edu/~bloom/src/ppmz.html
Manning, Christopher D., and Hinrich Schütze, (1999), Foundations of Statistical Natural Language Processing, Cambridge MA: MIT Press.
Rosenfeld, Ronald (1996), "A maximum entropy approach to adaptive statistical language modeling", Computer, Speech and Language, 10, see also http://www.cs.cmu.edu/afs/cs/user/roni/WWW/me-csl-revised.ps.
Shannon, Claude, and Warren Weaver (1949), The Mathematical Theory of Communication, Urbana: University of Illinois Press.
Shannon, Cluade E. (1950), "Prediction and Entropy of Printed English", Bell Sys. Tech. J (3) p. 50-64.
Turing, A. M., (1950) "Computing Machinery and Intelligence, Mind, 59:433-460