Designing File Formats

May 18, 2015 Steve Hawley

This article is going to talk about the challenges of supporting file formats, choices that should be made before you write a single line of code, and technical debt.

Let’s consider the purpose of file formats. Typically, they are meant to represent a specific hunk of data in a way that is hopefully convenient to consume and produce. There may be other criteria such as “readable by a small processor” or “uses minimal space” or “easy to extract metadata”.

A good file format will have a format for core data types and then a larger format for the aggregation of those types into the representation of the data in the file.

Image file formats are among the worst of formats. Many were built ad-hoc and contained some rather unfortunate problems that couldn’t be corrected with extensions. Others contained the ability to have extensions, but there was no owning group whose goal was to ensure that extensions were added in a reasonable way. Building a file format this way is like working with raw meat and refusing to wash your hands before you prep a salad: someone down the line is going to be very unhappy with the outcome.

What typically happens when writing code to interpret file formats is that the consuming code is expected to be as gracious as possible about mistakes and should accept all kinds of cockamamie files. Oddly enough, this standard doesn’t apply to the code that generates the files. Since humans are prone to writing code with errors, we end up with two bad things as a result:

  1. An ecosystem of invalid files that must be readable by any product that advertises itself as capable of handling that file format
  2. Software that is chock full of patch code to handle invalid cases from the swath of garbage files, which in turn makes the consuming software more fragile.

Both things together make for poor user experience.

An example is in the TIFF file format, which allows an image to be compressed in one of a variety of different compression schemes. Around 1992, JPEG compression was added to TIFF files, but not in a coherent way and the result has been horrible. Files compressed in this way are called “old-style JPEG” and many people in the TIFF community consider these files to be broken out of the gate. Since it was poorly designed and poorly handled, not only are there a number of these files in the ecosystem, there are a number which are written incorrectly.

This can create technical debt which has to be paid up every time code to support broken files has to be tweaked.

HTML also has a similar problem that is called tag soup. A lot of that is due to letting humans write HTML. HTML is code and humans are terrible at writing code.

How can you prevent this from happening?

First, you should design the format. What are you trying to represent? What data types should be exposed? How do you represent aggregates? How will I be able to distinguish my data from someone else’s?

You should make sure that the file format includes information as to what software created the file, when it was created, and by what version. This fingerprinting is essential. Humans are terrible at writing code. Seriously awful. Fueled by hubris and lack of coffee, we write code that makes bad files. When a file has been detected as bad, we’d like someone to blame contact to fix the problem.

Next you should write a tool to verify that any given file is both syntactically and semantically correct and this tool should be public as soon as possible.

Wait. What?

I haven’t written any code to read or write files and I need a verifier? Yes.

If you look at some of the more challenging file formats, PDF and TIFF come to mind, there are no good verifiers. “Hey!” you object, “doesn’t Adobe sell PreFlight?” Why yes, and I said good. PreFlight is OK and just OK. I’ve had correct files that it has complained about and incorrect files that it has missed. PreFlight was late to the game and offers a different interpretation of the PDF spec than Acrobat and for good reason: Acrobat has a longer history of accepting bad PDF. PreFlight is supposed to be like lint.

Ideally, I think that any file consuming code should be written like a compiler rather than the traditional model of “accept anything”. Can you imagine what web pages would be like if the browsers refused to accept non-compliant files? Or web servers that would preflight their output and reject the incorrect?

Unfortunately, the answer to that is “and then we wouldn’t have the web.” Most of the rapid adoption of HTML as a web publishing standard had to do with the ability of any old user with access to vi/emacs/teco/etc. to create content and content created interest. GIF, which is itself a fairly awful format, has seen a resurgence due to people using the limited animation features.

Obviously there’s an inherent social problem. If you don’t make file formats easy to generate, you won’t achieve adoption and the file format is likely to exist only in a footnote in history.

PDF goes completely counter to this. The revenue model for Acrobat after version 1.0 was “make the reader free and as ubiquitous as possible, but charge for the tools to generate/edit/mark up files. As a side note, Adobe very strategically injected tools into typical workflow to make it easier to generate PDF by having reliable print drivers ready on the ship date as well as insisting that Illustrator and PhotoShop would generate PDF. The result was that when people got a taste for what PDF could do, they started writing their own tools for making PDF rather than paying Adobe for the tools. Many people have done that task spectacularly and horrifyingly poorly. These coders typically have the free reader and their code-test-debug cycle becomes this:

WritePdfGeneratingCode();

     while (!AcrobatOpensMyFile())

           DebugMyCode();

And this operates on a faulty predicate. Because Acrobat opens and displays a PDF doesn’t mean that it’s correct. Sometimes, Acrobat will do repairs and let you know by asking you if you want to save the file. Sometimes it silently performs the repairs.

In my own code, I tried to avoid this problem. When I represent Acrobat data structures in my code, I have metadata associated with it that represents the PDF specification. For example, when a key in a PDF dictionary is marked as required, I have code that will trap if that key is missing at either parse or generation time so that class of error should never happen. Similarly, keys can be marked with the version of the PDF spec in which they were added so that it is impossible to use a later feature in a PDF intended to be an earlier version.

The initial products that Atalasoft released that were based on this code were PDF output-only and we had the freedom to use code that would reject any spec violations because we caught and fixed them before release through normal testing.

When we released tools for manipulating PDF, that freedom was no longer available to us. Customers had PDF documents that they want to read and we rejected them because they were in violation of the spec. For a time I would make a judgment call: is this particular class of spec violation something that we can practically allow or should we continue to reject it? Where possible, I tried to stand by the spec, but this is a great way to lose customers, especially when they come back with “but Acrobat opens it!”

To this end, I added a major feature in 10.4 to semi-automatically repair damaged PDF’s. This I felt was the better approach to handling issues. Instead of relaxing my detection of spec violation, I instead put in tools that aggregate and repair them. In some parlance, I implemented a “quirks mode”, but it’s more than that. Instead of throwing an exception, I allow a more formal process to happen. A problem is detected, it is reported along with severity and a proposed fix and a description of possible consequences of the fix. Other code evaluates the severity and consequences and if they are acceptable, the fix is made. If they are unacceptable, an exception will be thrown. I felt that using a formal process for handling issues was far better than simply softening the enforcement of spec rules.

By adopting this approach, I have moved the projected technical debt of accepting broken files away from the core code and into separate code. I’ve also informed the consumer as much as possible as to what just happened.

In sum, when you are considering making a file format, you should make sure that your short list of criteria include:

  1. Expressive enough to solve your problem and to grow
  2. Correctness (easily) verified by software
  3. Verification software made public or readily available
  4. Include/require the ability to identify the software that produces the file
  5. Choose a priori whether your file consumer should be broadly or narrowly accepting (and be prepared to accept the consequences)
  6. Consider an “inform and repair” approach rather than hacking your parser for all the special cases to mitigate technical debt

About the Author

Steve Hawley

Steve has been with Atalasoft since 2005. Not only is he responsible for the architecture and development of DotImage, he is one of the masterminds behind Bacon Day. Steve has over 20 years of experience with companies like Bell Communications Research, Adobe Systems, Newfire, Presto Technologies.

Follow on Twitter More Content by Steve Hawley
Previous Article
YAGNI - ORLY?
YAGNI - ORLY?

A discussion of YAGNI or “You Aren’t Gonna Need It”

Next Article
Spring Fling: Atalasoft 10.6 Document SDKs and MobileImage Now Available
Spring Fling: Atalasoft 10.6 Document SDKs and MobileImage Now Available

The bitter New England winter is now behind us here in sunny Easthampton,...

Try any of our Imaging SDKs free for 30 days with Full Support

Download Now