BFO PDF Library 2.27 - a PDF/A update

BFO PDF Library 2.27

We released version 2.27 of our PDF Library. If you're working with PDF/A this is a significant release - read on for details.

PDF/A

Many of the changes in this release come from work we've been doing with the Apache Tika Corpora, a collection of (currently) 32,580 PDF files submitted to public issue lists or bug trackers for various projects.

About a year ago we published an analysis of PDF/A conversion on about 15,000 files from another corpora, Govdocs1. If Govdocs1 contains typical PDF documents, the Apache Tika collection does not. It's comprised entirely of PDFs that were submitted as part of a bug report, so these files tend to be very poorly structured. Many were deliberately intended to stress-test PDF software.

Perhaps foolishly, our goal with this release was to convert as many of these as possible to PDF/A. We've largely achieved that, loading 98.8% of the files and successfully converting all but 4. Of the files we couldn't load, most are not acually PDF files, or are so truncated they have no catalog or pages.

We hope to do a proper write up of this later, but what matters for this article is that we had to do a lot of fixes to the API to handle the various types of corruption we encountered. Most of the changes in this release are fixes to ensure PDF/A conversion can better handle malformed files.

Some of the changes go further - partly to address bugs in our code, partly new functionality and partly bringing our PDF/A implementation up to date with the latest errata from the PDF/A Technical Working Group (of which BFO is a member). Here's a summary.

  • Our API wasn't previously identifying images that were damaged or truncated as being invalid for PDF/A. It isn't explicitly listed as a requirement for PDF/A, although it is implied, and a brief survey shows this test isn't consistently implemented across PDF/A verifiers. We've made the decision to disallow this in PDF/A, so we can repair it if encountered. The new ImageDamaged feature is part of all the PDF/A profiles, but this means that files previously certified as valid by our API may now appear invalid. To revert to the old behaviour, call clearDenied() on your PDF/A target profile with this feature.

  • PDF/A-1 uses different language to PDF/A-2 and A-3 to describe how Dates are formatted in the PDF metadata, with PDF/A-1 disallowing dates with times if no time-zone is specified. In previous releases we didn't check for this. Like the change above, this may result in some files previously certified as valid now appearing as invalid. The feature for this change is XMPDateWithoutTimezone and the solution is the same as above.

  • Some other very obscure issues (eg https://github.com/pdf-association/pdf-issues/issues/168) have been decided by the working group in a way that didn't match our validator. We've now updated our validation rules, although the impact of most of these should be negligable

Note that the first two items on this list remain open issues within the PDF/A working group - we're erring on the side of caution with these changes, and will update our validator to match when the issues are resolved.

We've also improved the functionality of the PDF/A conversion code.

  • When converting to PDF/A, the API requires a list of fonts for potential substitution. We were scanning this list and searching for a single font that was the best match and had all the required glyphs - if no font matched, it would fail and the page would rasterized. This new release will instead search for list of fonts that can be used for substitution, meaning that the chances of success are much higher. The API of OutputProfiler.FontAction has changed slightly as a result - see the API docs for details on this, migration for any custom implementations is trivial.
  • Profiling a PDF containing thousands of pages is a mullti-threaded operation, but while testing with some very large files (38000+ pages) containing text that shared a single font, we could see there was contention for some synchronized resources. We've rewritten these to use concurrent datastructures, and performance has jumped significantly. How much will vary depending on the number of threads, which defaults to the number of CPU cores available to Java. When testing with 10 threads we saw a 50% increase in speed.

Other changes

  • Many PDF processors incorrectly create the structure for PDF forms. We've adjusted the repair process in this release to try and handle some of the more exotic failures we encountered, and keep our results as close to Acrobat as possible. For a very small number of corrupt PDFs, the recovery process will change a result, which might mean the list of elements in the PDF Form is in a different order.
  • We've updated the bfopdf-jj2000.jar included in the download, after we fixed an error in the decoder. It's quite obscure, but we recommend upgrading. If you choose not to upgrade to version 2.27 of our API, you can always update and rebuild just this Jar from the github source.
  • If you're using the redactor, we recommend upgrading. We fixed an issue where the resulting PDF could sometimes be structured incorrectly, causing images to to missing or warnings in Acrobat.
  • And finally, we dropped support for Java 6. The PDF Library now requires Java 7 or later to run.

The CHANGELOG has more details.

Summary

If you're working with PDF/A this release is a significant one. Conversion to PDF/A is more robust, with significantly less "Rebuild" processing - the step we run internally if our repair process has failed. Out PDF/A profiles are slightly stricter than in previous releases, ensuring that files we convert to PDF/A are unambiguously correct.

As always, the latest release is available at https://bfo.com/download.