New features in the PDF Library 2.16

We've finally released 2.16 of our PDF Library. We've been working on this one for some time and while on the outside it doesn't look too different, internally there are a lot of changes..

Parser rewrite

The changes mostly revolve around our "parser" code, which scans the PDF content and does something with it - convert it to bitmap, extract the text, redact, and so on. The previous version of the code had "grown organically", which is a euphemism that hides a lot of horror. It had become pretty unwieldly, with no separation of the various actions - text extraction actually ran a "convert to bitmap" operation, but with a special Graphics2D object that discarded the result!

Our new version is much smarter internally and consists of an engine which parses the page content, and one or more cogs driven by the engine which can run independently - text extraction, redaction and so on. The advantage for us is code clarity, and the advantage for our customers is that we can turn around fixes and add functionality in this area much easier. For now the API remains the same, but we might consider opening up the "engine" that drives these converters in a future release.

The 2.16 release today lays the groundwork for a lot of things you'll be seeing over the coming months, including preflighting for PDF/A and PDF/X, and overprint simulation when rendering. For those of you have been waiting for for these, thank you for your patience - we're almost there..

Redaction

Redaction is one of the "cogs" for this engine and it's now working much better than the previous release, which tended to remove way more than was needed.

It will also now refuse to redact digitally-signed files. Digitally signed PDFs always store an unmodifiable version of the document at the time of signing in the PDF, and redacting the file would leave this signed, unredacted version intact so the signature can be verified. This isn't necessarily obvious, but the new release will refuse to redact if a signature is in place. This, and a huge number of other improvements mean that if you're redacting documents with our API then upgrading is definitely worth-while..

XFA processing

It's fair to say we're not huge fans of XFA here at BFO (I've ranted about them before), and our distaste is partly due to ambiguity and gaps in the specification. For this release we've augmented our internal documentation with the results of several hundred carefully controlled tests to determine what actually happens in the areas the specification is quiet on, and our rewritten XFA parser based on this testing should remove the need for some of the hacks, warnings and workarounds that have plagued customers with XFA documents until now. If you're working heavily with XFA documents, this release will help..

Generics

Generics were added to Java 5 in 2004, and a mere ten years later we've finally finished implementing them throughout our API - although many methods were converted 2.15.2, this release completes the job for the viewer and supplementary classes. For those of you still using Java 4 our compatibility Jar will continue to do the job without them, but for the rest this will give a little more type-safety where required..

Document Structure

Accessible PDFs can be created using the beginTag and endTag methods on PDFPage, PDFCanvas and LayoutBox. These were added to the API some time ago, but didn't work in all situations. They've been rewritten and we've added a new example to show how to use them called StructuredPDF.java.

All in all this is a large release, and with 700 commits and roughly 25,000 lines changed you'll understand why we've spent most of the last month testing. It's finally out and we hope it was worth the wait..