Compressing PDF Documents for Archive

August 13, 2013 Steve Hawley

There was an interesting article that appeared on Hacker News linked from David Kriesel about an issue with PDF documents scanned from Xerox scanners produce documents that have numbers randomly swapped.  At this writing, Xerox is working on a patch – awesome – this is exactly what they should be doing.  The issue, in brief, is that the software built into the unit that creates PDF documents uses JBIG2 compression.  JBIG2 compression is a (potentially) lossy compression algorithm that looks for nearly identical tiles and replaces all the tiles that are nearly identical with one tile.  This type of compression can create much smaller documents, but can create content errors if the predicate “close enough” replaces a ‘6’ with an ‘8’.

Atalasoft, to my knowledge, does not supply Xerox with the software that they use in their scanners, but Atalasoft does provide a powerful set of PDF creation tools.  So let’s think about the problem of encoding scanned documents for archive.  Your primary concern for archival storage should be primum non nocere – first, do no harm.  This means that the document should be stored in a way such that it will be readable.  There are four ways to address that:

  1. use sufficient scanner resolution
  2. use sufficient scanner bit depth
  3. use a compression that doesn’t damage the content
  4. use a process to review output

I created a simple document in Word with a phrase in 12, 9, 6, and 3 point Calibri and printed it on a Brother 8870dw printer at its best resolution, 1200x600 dpi, then I scanned the document back in using the Brother scanner at 200, 300 and 600 dpi in black and white and gray.  Here are the results:

200 dpi, Black and White
200 dpi, Gray
300 dpi, Black and White
300 dpi, Gray
600 dpi, Black and White
600 dpi, Gray

As you expect, quality improves quite a bit as the resolution increases and having a gray scale image also improves quality, but as you would expect, the uncompressed gray images are roughly 8x the size of the black and white images.

I chose 200 dpi as the bottom end since it is default FAX resolution.  You can see that it is acceptable for 9 point fonts in gray and marginal for black and white.  300 dpi is considered to be minimally acceptable for documents that go to an OCR engine.  This is acceptable in 6 point in gray and marginal in black and white (I would expect an OCR engine to mess up the 6 point text).  The 600 dpi scan is acceptable down to 6 point.  I consider the 3 point to be low-acceptable/marginal for gray and unacceptable in black and white.

The next thing to consider is compression.  For archival purposes, I do not consider it acceptable to use a lossy compression.  Therefore JPEG, JBIG2, and JPEG2000 are all unacceptable, even though they reduce the file size much more (note that if you use the archival form of PDF, PDF/A, JPEG2000 is forbidden, but JPEG and JBIG2 are both allowed) .  For 1-bit images, you should use CCITT encoding or Flate encoding whichever performs better (likely CCITT).  For gray scale or color images, use Flate encoding.

Now comes the tricky part – what if the cost of archiving the larger resultant files is prohibitive?  In that case, there should be a review tool – one that divides pages into categories: acceptable, marginal, and unacceptable.  This would be done by comparing the pixel content of the original image and the resultant PDF page at the same resolution and measures the amount of error between the two.  Within a certain low range of error, the document would be marked acceptable and archived.  Within a wider range or error, the page would have to be reviewed by a human being and checked for re-encoding.  Within a range of much greater error, the encoding would be automatically rejected and re-encoded with a non-lossy encoding.

The amount of human labor is daunting for review and there might be better ways to widen the range of the acceptable band by, say, running an OCR engine on both documents and comparing the output and also looking at the relative confidence of the OCR engine.  You can also reduce the amount of human labor by making better review tools – for example, a tool that does better than side-by-side, but also allows overlay of the differences in selectable highlight colors or trying to highlight areas of large error.

Human review, however, can be money extremely well-spent, especially if document content is information that needs to be stored with no errors, as is the case in medical records or land surveys or purchases.

All of these steps: PDF encoding selection, comparison, measurement, review etc. are all possible in DotImage.

About the Author

Steve Hawley

Steve was with Atalasoft from 2005 until 2015. He was responsible for the architecture and development of DotImage, and one of the masterminds behind Bacon Day. Steve has over 20 years of experience with companies like Bell Communications Research, Adobe Systems, Newfire, Presto Technologies.

Follow on Twitter More Content by Steve Hawley
Previous Article
Remember that Writing Software is a Creative Process
Remember that Writing Software is a Creative Process

Being in software engineering for more than 20 years I’ve met a lot of...

Next Article
Making Small PDF Documents
Making Small PDF Documents

There was a terrific question on stackoverflow on what is the smallest...

Try any of our Imaging SDKs free for 30 days with Full Support

Download Now