BFO PDF Library 2.27.2
BFO has released version 2.27.2 of our PDF Library, which is mostly about the "Arlington Model".
Arlington Model
PDF is an old specification, dating from 1993 - almost thirty years old as I write this. It wasn't ever designed for verification, but over the years various industry initiatives have attempted to improve things in this area - specifically PDF/A, which tightened the rules for how PDF should be created, and the move from PDF 1.7 to PDF 2.0, which involved a reevaluation and cleanup of the entire specification.
The latest initiative is the "Arlington Model", a machine readable representation of some aspects of the PDF specification - as it runs to over 1000 pages and references (directly or indirectly) over 1100 secondary specifications, it will never be all aspects. BFO have a long-standing interest in "correct" PDF based on our work with PDF/A, so after a presentation on the topic at PDF Days 2022 we decided to incorporate it into our API and see what we could learn from the process.
The model formally describes the various tables defining PDF objects. For example,
a PDF "Page"
object is a dictionary with a Type
value of /Page
; it may have a list of annotations, and it must have
a MediaBox
describing the page rectangle. The model is evolving but currently describes 544
different types of
object, described in a vendor-neutral format (TSV) and available at
https://github.com/pdf-association/arlington-pdf-model.
The BFO implementation of the Arlington Model
We felt the two questions this model could answer for customers would be "is this PDF correct?" and i"if not, how can I fix it?", so that's the focus we've taken with our implementation. The first step is retrieving a list of issues - places where the PDF deviates from the model.
PDF pdf = new PDF(new PDFReader(new File("input.pdf"))); OutputProfiler profiler = new OutputProfiler(pdf); List<ArlingtonModelIssue> list = profiler.getArlingtonModelIssues();
Compared to fully profiling a PDF this will be a very quick operation, and if the
returned list is empty there are no issues
at all with the file. But based on our testing, roughly half the time a PDF will have
at least one issue in the returned list.
Some of these issues are easily repaired, in which case the
getRepairType()
method on the issue will give a concise description. To repair all issues that can
be repaired:
List<ArlingtonModelIssue> list = profiler.getArlingtonModelIssues(); for (ArlingtonModelIssue issue : list) { if (issue.getRepairType() != null) { issue.repair(null); } }
That's the API at its simplest. There are things to consider - for example, if the
PDF is signed the repair process will almost
certainly invalidate the signature. In rare cases the correction could change the
way the PDF appears - for example, the
correct range for a color component is 0..1, but some PDF creators scale it to 0..255.
We can correct this by downscaling the
values to the correct range, making a previously invalid color valid (the
getRepairWarning()
method will return non-null if there's something you need to consider before repairing
- these repairs can be skipped if necessary). The
API docs for ArlingtonModelIssue
give more information on the process and the data you can get for each issue.
Finally, the Dump.java
example we ship with the PDF Library download will include a summary of Arlington
Model issues found.
It's a good way to get a quick overview of a PDF, and is usualy the first diagnostic
tool we use when you email us a problematic file.
Analyis
We've run this code over our core collection of 800 or so test documents, and found roughly half of them had problems. In most cases the problems were insignificant. For example, many files (including some we'd generated ourselves) had the wrong PDF version number. There are very few cases where the version number actually matters - PDF/A compliance is the only one we can think of - and PDF/A validators already test for that.
Other issues related to values in dictionaries which are required in the specification
but not actually checked by any tools - for example, bitmap Images must have a Type
value of /XObject
, but
this is often left out with no ill-effect. We think of these types of restrictions
as the "junk DNA" of PDF; requirements that exist,
but that generally don't matter. Every specification has some.
Some are more interesting. Many files with unembedded fonts leave out the Glyph Widths, making the layout unknown unless the font is available (an issue we cannot repair). Others create invalid content in the Structure Tree used by PDF/UA - while most users of PDF would never notice, it's possible that users of assistive technology tools would miss out on some of the stored data. Finally, other issue are flagging required behaviour - for example, custom Annotation Types or Signature Filters are both common in certain workflows and perfectly valid, even though they're not defined in ISO32000 and so are technically in violation.
Overall we think the Arlington model is an important addition to PDF which will mostly benefit PDF creation software like ours. We've found a few places where the files we created didn't match the model, and this release corrects that (a specific example: we'll generally now write the correct version number for PDFs we generate).
There are many places in our API where we make allowance for these kinds of errors
from other PDF creation tools which have been
found by trial and error (if you've ever emailed us a PDF that threw a NullPointerException
, chances are you've
been a part of this process). Widespread adoption of the Arlington Model is going
to reduce these occurrances across the industry.
Should you, as an end user, think about this? For validation alone, probably not. However the repair process is quick and safe, so if you're loading a PDF with our API, editing it then sending it to another tool for further processing, we suggest checking the PDF for model issues and repairing them before you save the file. We'd also suggest doing this if you're generating PDF/A files for archiving, which really should be as correct as possible.
Other changes to 2.27.2
While 2.27.2 is mostly about the Arlington Model, there are a few other features too. Most are minor bug fixes but one feature we wanted to note is that we've added support for ISO32001 and ISO32002 to the API, which add the SHA-3 family of signature hashes and support for Edwards Curves (EdDSA) in digital signatures (specifically Ed25519 and Ed448). EdDSA requires Java 15 or later, and both ISO specifications were published in October 2022 so it's unlikely you'll be seeing these algorithms in widespread use for a while. When they're eventually added to other PDF creators this release means they'll work just like any other type of signature.
Download
For more information please see the changelog, and as always you can download the latest version from our website.