Making Small PDF Documents

July 25, 2013 Steve Hawley

There was a terrific question on stackoverflow on what is the smallest possible valid PDF, which I answered.  First, I hand-built what I thought was the smallest spec-compliant PDF and then I started tearing bits out until I got the result down to 67 bytes.  It’s not as small as the smallest GIF, but it’s in the same spirit.

For grins, I wanted to see what DotPdf would do without too much hinting.  I get 920 bytes by creating a PDF with this code:

PdfGeneratedDocument doc = new PdfGeneratedDocument();
doc.EmbedGeneratedContent = false;
doc.Pages.Add(new PdfGeneratedPage(3, 3));
doc.Metadata.Producer = null;

which is not too bad.  I could make the PDF generation code be more stingy, but my generator does a little bit of pretty printing and also makes some references for objects that don’t strictly need them, but making them references makes the code less error-prone and more flexible.

If I send the following PostScript program through Adobe Distiller, I get a 5228 byte file using the “Smallest File Size” preset:

<< /PageSize [3 3] >> setpagedevice showpage

Looking at the PDF, most of the file is XMP metadata, which is not necessary by any interpretation of the spec.

If I do the same conversion with GhostScript, I get a 2328 byte file, which again is dominated by XMP metadata (if there were no metadata, likely the GhostScript file would be smaller than mine).

“Steve,” you say, “why are you using [3 3] as the page size?”  It’s a trick – the smallest spec-allowable page size is 3/72” x 3/72”.  By doing this, the /MediaBox entry in the page dictionary will be smaller – it will be [0 0 3 3] instead of [0 0 612 792], which will save 4 bytes.  I could have used any size from 3 to 9, inclusive and gotten the same effect.

For grins, I took a completely blank Word document and ran it through the Adobe PDF creator and it did two surprising things – first, it put a space on the page (there was none – the document was completely empty) and second, it embedded a subset of the font for that single space character (this is less surprising since the moment they placed the character, they were committed to embedding the font).  The result, 23.8K, most of which is a font for placing a space.

I don’t have any problem with any of these file sizes.  Each PDF producer was written to be a very general tool.  None of them know a priori what kind of document or content will be chosen for representation in PDF.  Each tool has to pick a particular spot in the continuum for size, speed, correctness, error-correction, and so on.  In my case, I picked a spot such that it would be very difficult to generate incorrect or spec-violating PDF and to that end, I chose to also make the output pretty because I knew that I would be reading my own output every now and again to figure out if I made a mistake.  Adobe picked a movable spot (as there are different profiles that generate various types of PDF) and none of the choices are incorrect.

About the Author

Steve Hawley

Steve has been with Atalasoft since 2005. Not only is he responsible for the architecture and development of DotImage, he is one of the masterminds behind Bacon Day. Steve has over 20 years of experience with companies like Bell Communications Research, Adobe Systems, Newfire, Presto Technologies.

Follow on Twitter More Content by Steve Hawley
Previous Article
Compressing PDF Documents for Archive
Compressing PDF Documents for Archive

There was an interesting article that appeared on Hacker News linked from...

Next Article
Maybe We’re Sending the Wrong Message
Maybe We’re Sending the Wrong Message

This is going to be about pranks.  We love pranks.  Yes we do.  The...

Try any of our Imaging SDKs free for 30 days with Full Support

Download Now