BFO PDF Library 2.26 - a significant release

BFO PDF Library 2.26

We're pleased to release version 2.26 of our PDF Library today. This one is a bit delayed, as we were talking to a client about PDF/A conversion and decided it was time to bring some of the boilerplate code we recommend to customers for PDF/A conversion in house, packaged into an API. So although the release was ready to ship, we decided to hold it off while we implemented this. "A couple of weeks", we said.

That was in April.

Today, at the end of July 2021 we've finally got this to a stage where we're happy with it. There's a lot going on in this revision so over the next couple of weeks we'll be publishing more articles with detailed information on some areas. This article will serve as our introduction to 2.26. Normally the diff between subsequent releases is about 5,000 lines or so, going up to 10,000 for significant new functionality. This one comes in at 46,800 lines.

Speed increase

We finally managed to tune our profiler to the sweet spot between a rough idea of what's happening and profiling is warping the measurements, allowing us to usefully integrate it into our nightly test run. This identified some code-paths that we hadn't previously identified as needing optimization.

In particular, validation of certifying signatures should be at least an order of magnitude faster for most cases. And the time taken to fully profile a large PDF has roughly halved. Not all operations will see such improvements, but for many use-cases it will be faster.

Portfolios

PDFs can have file attachments, and when a PDF contains very little except file attachments, it's called a Portfolio by Adobe Acrobat. We've had support for file attachments for a long time, but managing a large number of attachments, like you might find in a Portfolio PDF, needed work.

New is the ability to sort attachments into folders, add custom fields and (if the viewing application supports it) control how those fields are displayed, most of which is in the new Portfolio class. The canonical example of why you'd want to do this is email archiving; storing a digitally signed archive of emails, presented as PDF, and being able to sort by sender, subject, date and so on. This is all now possible and we hope to write an article showing how to do this over the next few weeks.

Conversion to PDF/A

The bulk of the changes for 2.26 relate to this. We're going to have a full article on this topic shortly but the headline is that we've been testing heavily with a set of 5,000 PDF documents, converting them to PDF/A-1, A-2 and A-3. Detail on this will follow, but as part of this we've been doing further comparative testing against both Acrobat and veraPDF, and our PDF/A profiles have adjusted slightly as a result (where differences remain we know why, and discussions in the PDF/A working groups are ongoing).

Stress testing

We've been bulk-testing with some of the PDF corpora that are becoming available, many of which focus on PDFs which - deliberately or accidentally - stress PDF parsers. This is an ongoing process, but a number of bugs (including some infinite loops, stack overflow and out-of-memory conditions) have already been fixed as we work through the roughly 55,000 files we now have to test with.

XMP metadata and Embedded Files

These two are lumped together because they're - uniquely, I believe - the only two datastructures which can appear literally anywhere within a PDF. It's possible to attach a file to a page in a PDF or an element in the StructureTree, and it's possible to set XMP metadata on a font, or even an ICC ColorSpace. We've had limited support for this for some years, with the getMetaData and setMetaData methods on a number of objects.

What's new in this release is getXMP method to alongside these methods, allowing metadata to be managed with our XMP class. Various objects, like PDFPage and PDFAnnotation now have a getAssociatedFiles() method to manage these. The OutputProfile class has new getXMPs() and getAssociatedFiles() methods to retrieve these from anywhere in the PDF, even places we can't set them. This is necessary because validating and (if required) removing them from the PDF is a part of PDF/A validation that we'd handled poorly in previous releases.

With these changes, we can manage these objects and repair them automatically. Which we do as part of PDF/A conversion.

Digital Signatures

As more countries are rolling out digital identify infrastructure, we're adapting our API to meet their requirements. Some minor chages have been made to bring our signature process more into line with Acrobat. In particular, LTV signatures require an OCSP response embedded in the PDF. But if it's older than 5 minutes, Acrobat will revalidate it when the signature loads, which causes problems if the OCSP responder is password protected (as currently it is for Singapore's PKI). We've shortened our cache time from 24 hours to 5 minutes to match Acrobat's behaviour, and made a few other fixes for compatibility with other systems.

Viewer customization

Our Swing PDF viewer has not had a visual refresh for some time - not necessarily a bad thing, as consistency is important in a UI. But for those wanting to customize it (and we do get requests) it hasn't been easy. While experimenting with various Look & Feel classes, we discovered FlatLAF, which we like a lot. In particular, the ability to configure most aspects of the UI using configuration files.

We've now added the same to our viewer, using a configuration file format identical to FlatLAF. This will make it easy to change icons, font sizes and other visual aspects of the viewer without wading into the code. We'll admit this feature is probably overdue by about a decade.

Viewer as a WebApp

Last but not least, we've done some testing and made a few changes to allow integration with Webswing, a very clever product which turns a Swing application into a Web application by translating Swing components running on a server to an HTML5 Canvas in a browser. This is an interesting option for anyone wanting to keep an existing Swing application built around our viewer, but with the benefits of running it on the web. We'll have more on this topic soon.

API changes

All of the above has resulted in some minor API changes, which are listed in the CHANGELOG as always. But for clarity, we'll relist them here. We don't expect any of these to be a major issue;

  • PDF.getXMP() never returns null - previously it would do so if the PDF has no XMP metadata, or the data there was invalid. To test for these conditions, pdf.getXMP().isEmpty() and pdf.getXMP().isValid()
  • The OutputProfile.ColorAction interface has had some methods added to it - and, because we still aim for Java 5 compatibility we can't implement these with default methods. In the unlikely event you're impementing your own ColorAction, you'll need to add these new methods, which can just return null. The OutputProfile.ProcessColorAction implementation of this interface we ship has had a method, setAllowUncalibrated(), removed with no direct replacement - as a half-hearted method of dealing with the color requirements of PDF/A, it wasn't compatible with the new approach.
  • Several methods in OutputProfile relating to OutputIntents are deprecated, but still function. We have a new OutputIntent class to encapsulate this functionality.

No more pack200

Our Pack200 compressed Jar is gone, mainly because some of our team build with Java 14 and pack200 no longer works. The applet classes remain, for now.

Conclusion

We've had a very, very busy few months. But the result is what we hope is one of our best releases yet. Available for download, as always, at https://bfo.com/download/.