BFO PDF Library 2.27
We released version 2.27 of our PDF Library. If you're working with PDF/A
this is a significant release - read on for details.
PDF/A
Many of the changes in this release come from work we've been doing with the
Apache Tika Corpora,
a collection of (currently) 32,580 PDF files submitted to public issue lists or bug
trackers for various
projects.
About a year ago we published
an analysis of PDF/A conversion on about 15,000 files
from another corpora, Govdocs1.
If Govdocs1 contains typical PDF documents, the Apache Tika collection
does not. It's comprised entirely of PDFs that were submitted as part of a bug report,
so these files tend to be very poorly structured. Many were deliberately intended
to stress-test PDF software.
Perhaps foolishly, our goal with this release was to convert as many of these as possible
to PDF/A.
We've largely achieved that, loading 98.8% of the files and successfully converting
all but 4.
Of the files we couldn't load, most are not acually PDF files, or are so truncated
they have no catalog or pages.
We hope to do a proper write up of this later, but what matters for this article is
that we had
to do a lot of fixes to the API to handle the various types of corruption we encountered.
Most of the changes in this release are fixes to ensure PDF/A conversion can better
handle malformed files.
Some of the changes go further - partly to address bugs in our code, partly new functionality
and
partly bringing our PDF/A implementation up to date with the latest errata from the
PDF/A Technical
Working Group (of which BFO is a member). Here's a summary.
-
Our API wasn't previously identifying images that were damaged or truncated as being
invalid for PDF/A.
It isn't explicitly listed as a requirement for PDF/A, although it is implied, and a brief
survey shows this test isn't consistently implemented across PDF/A verifiers. We've
made the decision to disallow this in PDF/A, so we can repair it if encountered. The
new
ImageDamaged
feature is part of all the PDF/A profiles, but this means that files previously certified
as
valid by our API may now appear invalid. To revert to the old behaviour, call clearDenied()
on your PDF/A target profile with this feature.
-
PDF/A-1 uses different language to PDF/A-2 and A-3 to describe how Dates are formatted
in the PDF metadata,
with PDF/A-1 disallowing dates with times if no time-zone is specified.
In previous releases we didn't check for this.
Like the change above, this may result in some files previously certified as valid
now appearing as invalid. The feature for this change is
XMPDateWithoutTimezone
and the solution is the same as above.
-
Some other very obscure issues
(eg https://github.com/pdf-association/pdf-issues/issues/168)
have been decided by the working group in a way that didn't match our validator. We've
now updated our validation rules,
although the impact of most of these should be negligable
Note that the first two items on this list remain open issues within the PDF/A working
group - we're erring
on the side of caution with these changes, and will update our validator to match
when the issues are resolved.
We've also improved the functionality of the PDF/A conversion code.
-
When converting to PDF/A, the API requires a list of fonts for potential substitution.
We were scanning this list
and searching for a single font that was the best match and had all the required glyphs - if no font matched,
it would fail and the page would rasterized.
This new release will instead search for list of fonts that can be used for substitution, meaning that
the chances of success are much higher. The API of
OutputProfiler.FontAction
has changed slightly as a result - see the API docs for details on this, migration
for any custom implementations
is trivial.
-
Profiling a PDF containing thousands of pages is a mullti-threaded operation,
but while testing with some very large files (38000+ pages) containing text
that shared a single font, we could see there was contention for some synchronized
resources.
We've rewritten these to use concurrent datastructures, and performance has jumped
significantly. How much will vary depending on the number of threads, which defaults
to the
number of CPU cores available to Java. When testing with 10 threads we saw a 50% increase
in
speed.
Other changes
-
Many PDF processors incorrectly create the structure for PDF forms.
We've adjusted the repair process in this release to try and handle
some of the more exotic failures we encountered, and keep our results
as close to Acrobat as possible. For a very small number
of corrupt PDFs, the recovery process will change a result, which might mean the list
of elements in the PDF Form is in
a different order.
-
We've updated the
bfopdf-jj2000.jar
included in the download,
after we fixed
an error
in the decoder.
It's quite obscure, but we recommend upgrading. If you choose not to upgrade to version
2.27 of our API,
you can always update and rebuild just this Jar from the github source.
-
If you're using the redactor, we recommend upgrading. We fixed an issue where the
resulting PDF could sometimes be
structured incorrectly, causing images to to missing or warnings in Acrobat.
-
And finally, we dropped support for Java 6. The PDF Library now requires Java 7 or
later to run.
The
CHANGELOG
has more details.
Summary
If you're working with PDF/A this release is a significant one.
Conversion to PDF/A is more robust, with significantly less "Rebuild" processing -
the step we run internally if our repair process has failed. Out PDF/A profiles
are slightly stricter than in previous releases, ensuring that files we convert
to PDF/A are unambiguously correct.
As always, the latest release is available at https://bfo.com/download.