Benefits and how to get the job done
There’s no question that PDF is a good choice for storage of documents in a document management system. Even so, for scanned documents, TIFF is still the most widely used format. That’s because TIFF is an open standard with a free implementation, supports multiple pages, and it uses minimal storage because it has a choice of pixel formats and compression algorithms.
PDF, however, has many advantages over TIFF, even for imaging. It has many of the compression formats of TIFF, and it also supports JPEG2000 and JBIG2, two newer formats that offer size and quality benefits over anything in TIFF.
More importantly, though, a PDF can have a lot more than just images in it. TIFF supports a rich metadata model. PDF has metadata and it can also store annotations and recognized text. In addition PDF viewers are everywhere, including mobile devices.
PDF for Document Images
Documents in a document management system typically go through three phases: Capture, Collaboration/Processing, and Archival. PDF has strengths over TIFF in all of three phases.
In the Capture phase, we are scanning the document, cleaning it, classifying it, indexing its text and metadata, and putting it into the proper place in our repository.
In the Collaboration/Processing phase, the document is going through a workflow. It is being actively read, marked up, and processed.
Finally, in the Archival phase, the document is a record of work. It won’t be actively looked at, but it needs to be available for reference or compliance. It’s vital that the document be findable via search.
PDF in the Capture Phase
The goal of the capture phase is to get the document into the system in a way that it will be useful later. Capture is a highly automatable process, but it’s also useful to use automation to assist a human operator.
The first thing you should do is scan directly as PDF. Nearly every scanner driver and capture software can do this. At this point, you should also OCR the document to create a Searchable PDF.
A searchable PDF is simply a PDF with multiple layers. The top layer is the original image as it was scanned in. Under that layer is a layer of text accurately positioned so that each word is directly behind the pixels that represent the word. Sophisticated OCR can also identify headers in the document and create PDF bookmarks.
This has a few benefits:
- You can select the text, copy it to the clipboard, and then paste into a metadata collection form.
- You can highlight search hits.
- With bookmarks, you can easily navigate to the important parts of the document.
- Search engines will be able to index the PDF and return it later as a search result.
If you scan batches of documents together, then during the capture phase, you need to split and merge pages to build your actual documents. You might also need to rotate pages that are upside-down or landscape. You should be able to automate some of this if you use can detect blank pages or recognize barcode separator pages. Pages that were not scanned correctly can be rescanned and used replace the original.
The next step is to classify the document. If you use standardized forms, then using a forms recognition engine can match the form against registered templates. You should also be able to use zonal OCR and Optical Mark Recognition (OMR) to extract the filled in fields.
Once you have classified the document, you can now fill in its metadata, put it into the proper place in the repository and kick off the initial workflows. A lot of this can be automated and assisted because the PDF already contains the text.
PDF in the Collaboration/Processing Phase
Document Images are often used to kick off processes. Bills to be paid, forms to be tabulated, receipts to be reimbursed, and medical records to be filed are typical documents that started out as paper, but are now in your document management system, ready to be worked on.
Because PDF supports a rich annotation model, you can add notes, highlights, stamps and other markup and be sure that anyone with a PDF viewer will be able to see it. In addition to the ubiquitous Acrobat Reader, many mobile and web applications have PDF viewers integrated into them. Unlike TIFF, you can be very sure that your PDF and its annotations will be viewable by anyone who needs to work with the document.
Typically, in order to process a document, you need to search inside of it. If you created searchable PDF in the capture phase, then you can search within the document and see your searches highlighted within the document
PDF in the Archival Phase
PDF shines as an archival format, specifically the subset of PDF that is standardized as PDF/A. This format was designed for archival, and it is guaranteed to be stable. Since PDF/A is a subset of PDF, all standard viewers, indexers, and information extractors know how to read it. The important thing to remember is to make sure that anything generating PDF knows how to limit its output to PDF/A.
But, the best part of PDF as an archival format is that it is searchable. Usually, when working with documents, you are generally aware of how to find the ones you need, and workflows are constantly making sure that everyone knows which documents they need to work on right now.
In archival, many years may have passed, and the staff may be different. Search is a critical feature to make sure that you can find necessary documents for audits, discovery, research or any other purpose.
DotImage for PDF
DotImage is the leading imaging toolkit for .NET developers. DotImage and most of its add-on SDKs include runtime royalty-free licensing for desktop applications and is also available in an OEM model that includes unlimited server deployments.
To learn more:
- Visit atalasoft.com/products/dotimage
- Call 866-568-0129 x 1 in the US or 413-572-4443 x 1 outside of the US
DotImage has the following features for PDF:
- Convert TIFF, JPEG, PNG, and many other formats to PDF (and vice versa)
- Combine multiple images and documents into a single PDF
- OCR images and create Searchable PDF
- View PDF in Winforms, ASP.NET, Silverlight and WPF applications
- Annotate PDF in Winforms, ASP.NET, Silverlight and WPF applications
- Store annotations as standard PDF annotations inside of any PDF
- Read and write bookmarks into any PDF
- Create PDF/A
- Create PDF using JPEG2000 or JBIG2 for images
- Scan directly as PDF (with TWAIN driver support) or convert scanned mages into PDF
- Read and write encrypted PDF
- Add, Insert, Remove, or Replace pages in any PDF
- Perform 150 image processing commands including advanced document cleanup commands
- Read 1D and 2D barcodes from a PDF
About the Author
This is a general account for case studies, product information, and articles about the culture of Atalasoft.Follow on Twitter More Content by Atalasoft General