Creating a product that is an API presents many challenges as an architect. There are a number of axes that describe trade-offs that are omnipresent when adding support for a particular feature. For example you might have an easy-to-understand public abstraction at the cost of a challenging (or unreliable) private implementation. I’m going to take you through the process I went through in order to implement a feature in DotPdf for a customer.
The back story is that the PDF specification includes a misfeature called “PDF Portfolios”. In the PDF specification, these are called “Portable Collections” (a portfolio in the real world is a collection of documents that you carry). This feature is a way in which a number of documents/files can be embedded within a single PDF file and accessed from within the viewer’s UI. The embedded documents need not be PDF, but could be a Word doc, email, text, images, etc. The resulting embedded files can be presented to the user using some kind of UI. When Acrobat creates such a document, it also embeds a single PDF page that directs the user to use the current version of Acrobat. A viewer that doesn’t handle portfolios will only display that page.
Currently, our PDF rasterizer and PDF tools do not present the content of PDF Portfolios (I’m working as fast as I can!). Our customer had a PDF portfolio created by third party software that did not include this burst page and our customer was confused when they didn’t get an actual document page.
My goal was to help our customer solve the problem: “How do I determine if a PDF document contains a PDF Portfolio?”
Now come the trade-offs. I could add a read-only property to PdfDocument and PdfGeneratedDocument called “ContainsPortfolio”. This is something to which I have access and the amount of work to add it in is very, very small. It is not the right solution for three very important reasons. First, although I have it readily available, the use case is “if this doc contains a portfolio, then handle it specially” and the cost of asking that question will is much more expensive in PdfDocument since we scan every page in the file to create a page collection and to get each page rotation. For PdfGeneratedDocument, the cost is higher since PdfGeneratedDocument also reads in all the content. The second reason that this is a poor decision is that in the future, I plan on having support for PDF Portfolios in both PdfDocument and PdfGeneratedDocument and if I put in that property, I’m creating a dead-end feature that will probably be obsoleted/deprecated when I add the Portfolio property to each. The third and final reason is that I can’t guarantee that that particular property is correct if the PDF had errors upon opening, which means that in a repair scenario, the customer’s code just got more complicated.
So to be precise, I want this feature to be low-overhead and its implementation to be complementary to future PDF Portfolio support instead of conflicting or dead-end.
I decided that this should be its own class. So I looked over the spec and made a list of all the information that I could extract from a PDF document at very little cost. Then I whittled that list down to a set that I thought our customers would find most valuable:
- Is it a PDF?
- Is the header valid?
- What is the PDF version?
- Is it encrypted?
- Is the declared PDF version correct for using a cross reference stream?
- Is it a Portfolio?
- Is the file “badly” damaged?
- Is the file declared to be PDF/A?
- Does it have a form?
- Does the form use XFA?
- Does the document declare that it has signatures?
- How many pages are in the file?
- What is the document Metadata (if any)?
- Does the document have XMP?
- Were errors encountered while aggregating at all of this?
Since all of these things are informational, it seemed that the best approach was to have an immutable class with properties to describe each of these features. The user would call a static factory method to get an instance of this object and then act on the information.
The only uncomfortable part is that in a typical UI bound application, you will possibly need to call this method multiple times. First to determine if the file is a PDF and if it is encrypted, then to prompt the user for a password and call the method again (until either the password is correct or the user gives up). After a successful password, the contents will be valid throughout (aside from errors in the file).
Now, let’s examine the time breakdown:
|Port to JoltPdf||3 hours|
If you look at this, implementation time was completely dwarfed by testing time. This was exactly what I’d expect. During testing, I discovered precisely one error which was a logic error in order of operation.
As far as the content of the class, I’m very happy because I’ve solved not only the customer’s problem, but a set of other common problems as well. For example, given a repository of PDF documents, this code gives access to the document metadata which could be used for simple sorting and searching. I haven’t blocked myself in with the design, either. If anything, I’ve opened up a future avenue in that the Examiner namespace could be used for PDF/A analysis, color analysis and so on. The only thing I don’t like is that there are a number of low-overhead things I report that are advisory rather than authoritative. For example, the PDF can advertise that it is PDF/A without actually complying full to the PDF/A rules or if the number of errors reported is zero, it only means that zero errors were encountered when trying to get this information and there may be others. To be authoritative, those types of checks become very expensive, which is why I chose not to include them. This is also why I chose not to include a method for accessing XMP metadata. I can easily check to see if there is XMP, but the cost of blindly retrieving it is high and putting in an accessor to read it later creates a problem of ownership of the underlying stream, so I decided that wasn’t worth it.
From this real-world example, you can see that the process of implementing a feature for a customer takes more thought and time than simply adding a property to a class. An architect needs to consider the needs of the specific customer, if there is a larger set of problems than can be solved at the same time, the side-effects of the change on existing and future API design, the cost of implementation in terms of time, and performance. By taking all of these elements into consideration, it is possible to strike a balance that isn’t a misfeature or feature creep.
If there is a feature that you would like to see in a future release please let us know about it.
About the AuthorFollow on Twitter More Content by Steve Hawley