BFO PDF Library 2.27.2 - introducing the Arlington Model

BFO PDF Library 2.27.2

BFO has released version 2.27.2 of our PDF Library, which is mostly about the "Arlington Model".

Arlington Model

PDF is an old specification, dating from 1993 - almost thirty years old as I write this. It wasn't ever designed for verification, but over the years various industry initiatives have attempted to improve things in this area - specifically PDF/A, which tightened the rules for how PDF should be created, and the move from PDF 1.7 to PDF 2.0, which involved a reevaluation and cleanup of the entire specification.

The latest initiative is the "Arlington Model", a machine readable representation of some aspects of the PDF specification - as it runs to over 1000 pages and references (directly or indirectly) over 1100 secondary specifications, it will never be all aspects. BFO have a long-standing interest in "correct" PDF based on our work with PDF/A, so after a presentation on the topic at PDF Days 2022 we decided to incorporate it into our API and see what we could learn from the process.

The model formally describes the various tables defining PDF objects. For example, a PDF "Page" object is a dictionary with a Type value of /Page; it may have a list of annotations, and it must have a MediaBox describing the page rectangle. The model is evolving but currently describes 544 different types of object, described in a vendor-neutral format (TSV) and available at https://github.com/pdf-association/arlington-pdf-model.

The BFO implementation of the Arlington Model

We felt the two questions this model could answer for customers would be "is this PDF correct?" and i"if not, how can I fix it?", so that's the focus we've taken with our implementation. The first step is retrieving a list of issues - places where the PDF deviates from the model.

   PDF pdf = new PDF(new PDFReader(new File("input.pdf")));
   OutputProfiler profiler = new OutputProfiler(pdf);
   List<ArlingtonModelIssue> list = profiler.getArlingtonModelIssues();
  

Compared to fully profiling a PDF this will be a very quick operation, and if the returned list is empty there are no issues at all with the file. But based on our testing, roughly half the time a PDF will have at least one issue in the returned list. Some of these issues are easily repaired, in which case the getRepairType() method on the issue will give a concise description. To repair all issues that can be repaired:

   List<ArlingtonModelIssue> list = profiler.getArlingtonModelIssues();
   for (ArlingtonModelIssue issue : list) {
       if (issue.getRepairType() != null) {
           issue.repair(null);
       }
   }
  

That's the API at its simplest. There are things to consider - for example, if the PDF is signed the repair process will almost certainly invalidate the signature. In rare cases the correction could change the way the PDF appears - for example, the correct range for a color component is 0..1, but some PDF creators scale it to 0..255. We can correct this by downscaling the values to the correct range, making a previously invalid color valid (the getRepairWarning() method will return non-null if there's something you need to consider before repairing - these repairs can be skipped if necessary). The API docs for ArlingtonModelIssue give more information on the process and the data you can get for each issue.

Finally, the Dump.java example we ship with the PDF Library download will include a summary of Arlington Model issues found. It's a good way to get a quick overview of a PDF, and is usualy the first diagnostic tool we use when you email us a problematic file.

Analyis

We've run this code over our core collection of 800 or so test documents, and found roughly half of them had problems. In most cases the problems were insignificant. For example, many files (including some we'd generated ourselves) had the wrong PDF version number. There are very few cases where the version number actually matters - PDF/A compliance is the only one we can think of - and PDF/A validators already test for that.

Other issues related to values in dictionaries which are required in the specification but not actually checked by any tools - for example, bitmap Images must have a Type value of /XObject, but this is often left out with no ill-effect. We think of these types of restrictions as the "junk DNA" of PDF; requirements that exist, but that generally don't matter. Every specification has some.

Some are more interesting. Many files with unembedded fonts leave out the Glyph Widths, making the layout unknown unless the font is available (an issue we cannot repair). Others create invalid content in the Structure Tree used by PDF/UA - while most users of PDF would never notice, it's possible that users of assistive technology tools would miss out on some of the stored data. Finally, other issue are flagging required behaviour - for example, custom Annotation Types or Signature Filters are both common in certain workflows and perfectly valid, even though they're not defined in ISO32000 and so are technically in violation.

Overall we think the Arlington model is an important addition to PDF which will mostly benefit PDF creation software like ours. We've found a few places where the files we created didn't match the model, and this release corrects that (a specific example: we'll generally now write the correct version number for PDFs we generate).

There are many places in our API where we make allowance for these kinds of errors from other PDF creation tools which have been found by trial and error (if you've ever emailed us a PDF that threw a NullPointerException, chances are you've been a part of this process). Widespread adoption of the Arlington Model is going to reduce these occurrances across the industry.

Should you, as an end user, think about this? For validation alone, probably not. However the repair process is quick and safe, so if you're loading a PDF with our API, editing it then sending it to another tool for further processing, we suggest checking the PDF for model issues and repairing them before you save the file. We'd also suggest doing this if you're generating PDF/A files for archiving, which really should be as correct as possible.

Other changes to 2.27.2

While 2.27.2 is mostly about the Arlington Model, there are a few other features too. Most are minor bug fixes but one feature we wanted to note is that we've added support for ISO32001 and ISO32002 to the API, which add the SHA-3 family of signature hashes and support for Edwards Curves (EdDSA) in digital signatures (specifically Ed25519 and Ed448). EdDSA requires Java 15 or later, and both ISO specifications were published in October 2022 so it's unlikely you'll be seeing these algorithms in widespread use for a while. When they're eventually added to other PDF creators this release means they'll work just like any other type of signature.

Download

For more information please see the changelog, and as always you can download the latest version from our website.