BFO PDF Library 2.29.5
Late last week we release 2.29.5 of our PDF Library. Despite being a "point" release it's unusually large, with a diff of 27,000 lines. We will attempt to summarise the changes slightly more concisely.
Memory optimizations
This is the first item for good reason - everyone likes an optimization. In the previous release we did a lot of work on speed optimizations, but the memory optimizations that were the focus of this release are a much harder problem, particularly in a 25-year-old API - all of the easy things have already been done. However profiling on extremely large documents showed there were two areas we could really make a difference in:
- Paths - the most fundamental unit of any vector graphics, a path is a sequence of lines and curves. Fonts, vector graphics - these are entirely made up of paths. While the Java API has the Path2D class, we use our own implementation which has allowed us to make a few optimizations internally. These optimizations apply across the board to all graphical operations, especially rendering, text extraction and so on.
- Structured Documents - the Structure Tree that is used for tagged PDF formats like PDF/UA has seen a lot of optimizations to reduce the size as much as possible.
For workflows like validating very large PDF/UA files, we're seeing a memory reduction of about 20% as a result of these changes. Smaller documents will see less benefits, of course, but it all helps.
Right-to-left text
One of the other focus areas for this release has been improvements for right-to-left text.
This is a topic that is under active discussion within the PDF Association, and more changes may come as a result. But for now we're generally satisfied that the new approach gives better results. It has meant some changes to the way text from non-tagged documents are extracted, but these mostly just apply to whitespace.
PDF/UA rules
The PDF/UA Technical Working Group (BFO is an active member) has been discussing the exact rules around the "Scope" attribute for table headers, which have led to some changes to the PDF/UA validation profiles. These apply mostly to edge cases, and documents that were previously considered well-formed will generally remain so.
Arlington Model updates
We've updated the Arlington Model rules for the first time in two years. Although this is only ever informative in our API, it's remains a useful way of looking at PDF and is run as part of our OutputProfiler class when validating a PDF. We also, finally, identified a thread-race that could rarely cause an issue to be identified twice, fixing a long-standing annoyance in our regression testing.
PDF to HTML derivation
There have been quite a few improvements to our PDF to HTML derivation routines, via
the
HtmlDerivation class. This is
an area we're working on actively and the improvements are based on testing with a
good number of documents:
improvements include proper extraction of PDF Figure objects, and where those consist of
white content (i.e. white text on a colored background), we will automatically outline
it to make it visible
above the white background of the HTML.
Viewer event changes
Our Swing viewer classes are driven by Events, like most Swing applications, but if an AWT event (like a mouse-click on a button) resulted in a PDF Property Change event indicating a change to the PDF model - such as a form field field changing value - there was no way to correlate one event with the other. This made it impossible to determine if a field changed due to a user-interaction, or programatically.
We emit a standard Java PropertyChangeEvent from our API when the model changes, and this has no way to identify the cause of the event. But it does have a Propagation ID which is "reserved for future use" according to the Java API doucumentation. We're now setting it to the AWT event that triggered the change in the PDF model - a slightly clunky solution to a real problem which we hope will be useful to some.
Summary
More detail is in the CHANGELOG. These changes will be applied to new Report Generator and BFO Publisher releases in the very next future. Downlaod from https://bfo.com/download.