OCR Test Images

The images below are intended to be a severe test of optical character recognition (OCR) software. They are presented in order of increasing difficulty. I intend to post benchmarks of OCR software here but have not run any tests yet. If you would like to submit test results, send me the text output (plain text preferred) for each image you test. Tell me what software you used, what version, and what settings you used. Try to use the settings that give the most accurate results. I will post your results and give you credit. Contact me at matmahoney@yahoo.com

Stock

This text from USA Today was scanned at 200 dots per inch (dpi), 8 bit grayscale on a Primax Colorado 600p scanner and saved as a jpeg file at 75% quality using MGI PhotoSuite 8.06 (1998). The image contains serif and sans serif fonts in various sizes, some bold, and a logo. The text does not easily fit a language model but does have some redundancy by being organized into columns of letters and numbers and being alphabetically sorted. The JPEG files were saved with varying quality settings to introduce artifacts. The original bmp file is the raw image (lossless). Left click the link to view, right click to download.

8 bit grayscale, 200 dpi: bmp jpeg 75% jpeg 50% jpeg 25%

Plaid

This test image contains a veriety of stylistic fonts and rotated text, some with an interfering background, that is easily read by humans but is difficult for optical character recognition software. The image above was scanned from a newspaper (slightly rotated). The image above is 24 bit color jpeg, 150 dpi, 75% quality. Other quality settings of the same image may be viewed below.

24 bit color, 150 dpi: bmp jpeg 75%
1 bit black and white, 200 dpi: bmp jpeg 75%
8 bit grayscale, 300 dpi: bmp jpeg 75%

Captcha

The images above are CAPTCHAs used by Google. They are deliberately designed to be hard to read by OCR software.

captcha 1 captcha 2 (jpeg)

Handwriting

This is a partial scan of a lap split sheet used in a footrace. Each pair of rows of numbers in this image was hand written by a different person.

150 dpi grayscale: bmp jpeg 75%

These images were posted on Jan. 7 2006 by Matt Mahoney

Results for OmniPage v14.0 (The current version is v15) by Yan King Yin, June 13, 2006. (link no longer works).

Results for Abbyy Fine Reader 5.0 (this version is about 2 years old. Current version is 8.0) by Yan King Yin, June 13, 2006. (link no longer works).

Results for Tesseract 2.03 (Mac) by Dave Osbourne, July 6, 2009.