BFO PDF Library 2.25 - what's new?

BFO PDF Library 2.25

Our first update for 2021 is fairly significant we think - the changelog has got rather long as we resolved issues, so here's a summary of the highlights.

Memory Footprint

The two things guaranteed to interest every customer are memory and speed improvements. The PDF API started life over 20 years ago now, you'd have thought any scope for memory improvements across the board would have gone. We certainly did, but we were wrong.

We've managed to squeeze about a 10% reduction in memory for all workflows. Careful examination of the basic objects in the PDF structure showed there were places we could make savings: for example, a structure that had 6 booleans and an int (10 bytes) has been reduced to 4 bytes by packing the booleans into the upper 6 bits of the int, which only ever needed 24 bits anyway. Another structure is the PDF object reference - a 32-bit number and a 16-bit generation. For the majority of objects the generation is zero, so we've made two implementations, one with the generation hard-coded. That saves two bytes an object.

None of this sounds like much, but the structure of the PDF held in memory is entirely comprised of these sort of objects - large binary streams for images and pages typically remain on disk. So even small savings add up. Of course these sorts of tricks do not make for nice code, but fortunately for our customers that's our problem to manage.

Memory Footprint during linearization

Linearization is the process of rearranging the PDF objects and including structures that allow a PDF viewer like Acrobat to download only the bits of the file it needs for the current page. Perhaps more useful in the dark ages before universal broadband, it still has a place for really large documents - think 10000 pages, or 10GB files.

It involves either knowing the size of some objects are before writing them, or seeking into the file to edit the length in later. As we write to stream, not a file, we had to either hold the serialized objects in memory or dump them to disk temporarily with a DiskCache. After a few tweaks, testing showed we could write the data twice, counting and discarding the bytes the first time round, with no measurable loss in speed.

What this means is a huge reduction in memory footprint when saving a linearized PDF - for one large testcase we reduced the heap requirements from 900MB to 80MB. Of course at that size most customers would be using a DiskCache anyway, but for mid-level smaller files being held in memory, this reduction will really help. If you're linearizing and already using a DiskCache, that's fine - it will just be used less, if at all. No changes required.

XMP

We've been doing a lot of work with a particular customer on PDF/A conversion. Managing the metadata has always been an issue. The previous release introduced our new XMP class, and this release has seen a lot of polishing of this class. To give you an idea of what can do:

for (XMP.Schema schema : source.getSchemas()) {
  try {
    target.addSchema(schema);
  } catch (ProfileComplianceException e) { }
}
for (Map.Entry<XMP.Property,XMP.Value> entry : source.getValues().entrySet()) {
  XMP.Property key = entry.getKey();
  XMP.Value value = entry.getValue();
  if (!key.getType().isUndefined() && key.isValid() && !value.getType().isUndefined()) {
    target.set(key, value.clone(target));
  }
}

That will migrate any valid content from a source XMP object to target, in such a way that the target is guaranteed to be valid for PDF/A - reusing valid extension schema from the source PDF, and adding a new one where required for 350 or so properties from dozens of common schema we know about. If you've spent as much time on XMP as we have, that in itself is remarkable. Metadata is important, and finding a way to preserve as much as possible when converting a PDF to PDF/A has always been a lot more work than it should be. PDF/A-4 recognises this and removes a lot of the validation requirements, but PDF/A-1, 2 and 3 will be with us for a long time yet.

PDF/A-4

As mentioned a couple of months back, PDF/A-4 is coming. It's now official and we've removed the "draft" label from the profile.

Color

For many many years we've beeing using a not-quite-correct method of specifying the sRGB colorspace in PDFs we generate - originally to avoid bugs in the Java color classes. We wrote these out a couple of years ago with our ICCColorSpace class, and this release finally removes our workaround from the generated PDFs, as well as using the correct value for the CIE Illuminant D50 (we had previously been using the "nCIEXYZ D50 Illuminant", a 16-bit approximation: the values differ by about 0.03%, surprisingly still perceptible on some tests).

This will mean nothing to almost everyone, but if you notice a barely perceptible shift in the colors in PDFs we generate, or when rasterizing a PDF to bitmap with our API, that's why.

PKCS#11 signatures

A minor change was required to match the process for applying PKCS#11 signatures to the expectations of some of the native libraries that communicate with the device. This meant a change to a public interface, and because this will require code-changes if you're implementing that interface (PKCSSignatureHandler.SigningEngine) we bumped the version number to 2.25. You're almost certainly not, and if you are, we probably know about it so will be sending you an email to make sure everything is working for you.

Summary

There's quite a lot going on this release, and lots of other issues we've barely touched on. But if you're working with linearized PDF, or are converting to PDF/A and want to manage your metadata, this release is highly recommended. Details in the CHANGELOG as always. Download from http://bfo.com/download.