Image Compression and Why You Should Care

August 31, 2015 Steve Hawley

Like it or not, we are working towards an increasingly paperless society. This presents a number of challenges which may be in conflict with each other. In an ideal world, a scanned document would be:

  1. Visually indistinguishable from the original
  2. Take up a minimum amount of storage
  3. Easy to interchange
  4. Easy to validate/verify

The problem is that all of these can conflict. If you want visual accuracy, storage will go up. When you drive storage down, you lose accuracy. If you want ease of validation, you’re likely to lose in other places as your choices will get driven down.

There is no ideal format that will get you all of these.

PDF does strike a middle ground, but in order to get a predictable result from scanned documents, you need to understand image compression.

PDF on its own does not dictate how an image should be compressed. In fact, since I worked on Acrobat, PDF added two compression methods.

Further, compression or more precisely filtering in PDF is tied to embedded data streams, not to images.

In PDF, a data stream may be processed through a set of filters (note that some filters are restricted to images). Those filters may be one or more of the following:

Filter

Image Only

Characteristics

Crypt

No

Strictly used for encryption of data within a PDF file. In password protected PDF files, the encryption filter is implicit.

ASCII Hex

No

This filter encodes binary data in so that each byte is represented by two hexadecimal digits. I’ve not seen this filter used in the wild, and for good reason: it always doubles the amount of data.

ASCII 85

No

This is an encoding wherein a set of 85 printable characters are used to encode binary data such that 5 output bytes will encode 4 input bytes. This is an increase of 20% which is better than ASCII Hex. When Acrobat was created, many mail gateways would mangle PDF attachments if they contained non-printable characters or very long lines. ASCII 85 was used to prevent this and was usually applied on top of a filter that reduced the data size.

LZW Decode

No

LZW is a compression technique where strings of bytes are accumulated and treated as shorter bit patterns. It’s like taking all the uses of the word ‘and’ and replacing them with ‘&’. LZW is a decent enough compression, but has a checkered past due to patent contention and for many years was contraindicated to avoid getting caught up in a patent suit. As of 2004, all relevant patents expired. The compression of LZW is decent.

Deflate

No

Deflate was a response to LZW. It’s similar in many respects, but doesn’t violate the patents of LZW. The compression of Deflate is decent.

Run Length Encoding

No, but really only useful in images.

Run length encoding is an obvious mode of compression. It’s exactly the kind of scheme that gets assigned to first year CS students. It was used in the original Macintosh as a compression for MacPaint files. It works best on 1 bit images that are mostly the same bit pattern (white or black). In images that have a 50% gray stipple pattern, it will grow instead of compress. The overall compression is not very good.

CCITT G3 and G4

No, but really only useful for 1 bit images.

This compression was designed for fax machines so that data transmitted from one machine to another didn’t take an eternity. It works by compressing runs of bits, not bytes. In G4, the compression is in two dimensions instead of within a line. The compression is decent.

JPEG

Yes

JPEG compression is the first of the non-obvious compression techniques in that it looks at the data a signal comprised of frequencies and attempts to remove higher (and hopefully non-dominant) frequencies from the signal leaving behind the an easier to compress data stream. The problem with JPEG is that when you decompress the data it will likely differ from the source data. In this list, it is the first compression technique that is considered to be ‘lossy’. The compression is good and is, to a degree, selectable but as the compression increases, so does the data loss.

JBIG2

Yes

JBIG2 is a compression technique wherein an image is decomposed into a series of symbols. Symbols are grouped into sets that are the same (or close enough) and representative symbol is selected. Each occurrence in the image of a member of that set is erased and the location is noted so that on decompression the representative symbol will be placed there. JBIG2 may or may not be lossy and can be configured either way.

JPEG2000

Yes

An image compression that is similar to JPEG in some respects, but uses wavelet compression. JPEG2000 was designed to be far more flexible than JPEG. It can be lossless or lossy and unlike other methods can be given a target output data size. The compression can be very good, but to achieve acceptable output, it may require specific adjustments in the compressor on an image-by-image basis.

I know: tl;dr.

It comes down to this: what am I compressing and what are my constraints?

If you don’t care about the visual integrity of your document, you should consider yourself free to use any of the compressions above in any configuration.

If you do care about the visual integrity of your document, you should avoid all of the lossy compressions (in light blue) or they should be tuned, if possible, to a lossless mode.

When I make PDF documents for my own use, I usually follow these guidelines:

If the document must be widely consumable and must be exactly as the source, I will use CCITT for 1-bit images and Flate for all else.

If the document must be widely consumable and I want it to be good visual quality, I will use CCITT for 1-bit images, JPEG for pictures, and flate for non-1-bit text.

If the document will be viewed with only Acrobat or other professional grade viewers and I care about visual integrity, I will use CCITT or lossless JBIG2 for text, lossless JPEG or lossless JPEG2000 for pictures, and flate for non-1-bit text.

If you are building products from DotImage, and I hope you are, you can make all these choices yourself. The tricky part is how you communicate them to your customers so that they can make appropriate choices through your software.

There was an issue with Xerox recently because they shipped software on multifunction devices that would scan documents directly to PDF documents and would apply lossy JBIG2 compression to the images. The results were unfortunate, but entirely predictable: the resulting documents had errors in them so that they were no longer accurate.

David Kriesel recently gave a talk about how he discovered this issue and what steps he took to ensure that Xerox addressed it. Although the talk contains some language that is not polite, it is worth sitting through as not only does it describe the problems with JBIG2 in fine detail, but also is a model for how to handle these kinds of issues with a large company, how to not handle them if you are a large company, the importance of having really solid support, the importance of correct default settings and the importance of clear UI and manuals. And of course, the importance of document integrity and trust.

About the Author

Steve Hawley

Steve has been with Atalasoft since 2005. Not only is he responsible for the architecture and development of DotImage, he is one of the masterminds behind Bacon Day. Steve has over 20 years of experience with companies like Bell Communications Research, Adobe Systems, Newfire, Presto Technologies.

Follow on Twitter More Content by Steve Hawley
Previous Article
Dictionary is Not Suitable for Caching
Dictionary is Not Suitable for Caching

How to implement a cache without using the Dictionary class.

Next Article
Planning on going mobile? Now is the time
Planning on going mobile? Now is the time

When is the best time to start our mobile development.... Now!

Try any of our Imaging SDKs free for 30 days with Full Support

Download Now