BFO PDF Library 2.26.1
Exactly two months after the 2.26 release of our PDF library, we're pleased to announce the first revision to that major release. This one is all about polish.
A large proportion of the work was refining our PDF/A conversion routines, cross-testing against both veraPDF and Acrobat. A number of edge cases have been fixed, and in general the PDF/A output for this release is improved. In particular our PDF/A-4 profile has had quite a few fixes: although we still haven't completed testing of conversion to PDF/A-4, verification of existing PDF/A-4 files is now in complete agreement with veraPDF (Acrobat does not yet support PDF/A-4).
There are two main things we want to highlight today: logging, and test corpora
Further stress testing against large Corpora
The website of the PDF Association, of which BFO is a member, has many useful resources - not least the page describing the Apache Tika corpora of stressful PDFs. These are PDFs collected from the "issue" pages of various open-source projects, so - by definition - are problematic. They contain some of the most malformed collection of PDF files we've ever seen, and make a great acid test for any PDF product.
We've completed running our API over the 20,576 files and fixing issues. Many are too damaged to recover, eg severely truncated. But where the file can be recovered it can be loaded with our API, and if not the reasons for failure are reported in a hopefully useful way. For our customers this means the API is less likely to choke on invalid input, and in particular some of the most pathalogical cases designed to produce infinite loops or out-of-memory results in the data structures are avoided.
I'd particularly like to highlight some of the improvements to signature handling that came from this process. The corpora contained many weird variations of PKCS#7 signatures we hadn't seen before (ASN.1 structures can be "somewhat vague"). Our API can now verify RSA PSS signatures, ECDSA signatures in the P1363 format, signatures using Russian Gost1360/1361 algorithms, and various types of OCSP response we hadn't encountered before. Some of these are used with retired or current national ID schemes. If it's a digitally signed PDF with a PKCS#7 signature, our API can almost certainly verify it.
Logging cleanup
Analysing so many malformed files meant wading through a lot of error logs, which motivated us to clean up the logging. When we encounter a recoverable error, we warn about it and often include a stack-trace with that warning. This can be confusing if you're not expecting it - we often get emails from customers with stack traces included from warnings, even though they're not fatal.
For this release we've audited every warning emitted by the library during "typical" operations. Wherever possible, we've included all the important information in the message itself and removed the stacktrace from the report. Where multiple warnings were issued from the same root cause we've tried to reduce this to one. Overall, warning logging will be less noisy and more useful, which should be good for everyone.
This is an ongoing project, of course, and some stacktraces remain where they're required for diagnostics. Please continue to include them in your support emails if you see them and you're having problems.
Refinements to PDF/A conversion
Finally, we've made some refinements to our PDF/A conversion process based on feedback. Where we have to rasterize a page to a bitmap, we'll store it as grayscale if it can be done without loss, reducing the size of the file. And it's now possible to supply an Executor to allow pages to be rasterized in parallel. Conversion to PDF/A can now be interrupted where necessary, and the PDF will be left in a coherent state.
Mostly, however, we've had a major drive to reduce the number of rebuilds during conversion. As described in our last post on this topic, a rebuild is when our conversion to PDF/A has failed. As a last resort we clear out the PDF and restore only known-good data. This is not something we want to do, as data is inevitably lost.
The number of PDFs requiring a rebuild during conversion will be a lot lower in this release. The main cause is what we call architectural problems - a random dictionary, array, string etc in a datastructure we don't recognise that is too large to meet the requirements of PDF/A. In fact, excluding those (and one other cause we aim to fix for the next release), we're now down to one single rebuild out of 18,400 files! We're very pleased with this number. More analysis will follow in another article.
Download
As always, you can download the latest version and upgrade any time.