BFO PDF Library 2.29.5

BFO PDF Library 2.29.5

Late last week we release 2.29.5 of our PDF Library. Despite being a "point" release it's unusually large, with a diff of 27,000 lines. We will attempt to summarise the changes slightly more concisely.

Memory optimizations

This is the first item for good reason - everyone likes an optimization. In the previous release we did a lot of work on speed optimizations, but the memory optimizations that were the focus of this release are a much harder problem, particularly in a 25-year-old API - all of the easy things have already been done. However profiling on extremely large documents showed there were two areas we could really make a difference in:

  • Paths - the most fundamental unit of any vector graphics, a path is a sequence of lines and curves. Fonts, vector graphics - these are entirely made up of paths. While the Java API has the Path2D class, we use our own implementation which has allowed us to make a few optimizations internally. These optimizations apply across the board to all graphical operations, especially rendering, text extraction and so on.
  • Structured Documents - the Structure Tree that is used for tagged PDF formats like PDF/UA has seen a lot of optimizations to reduce the size as much as possible.

For workflows like validating very large PDF/UA files, we're seeing a memory reduction of about 20% as a result of these changes. Smaller documents will see less benefits, of course, but it all helps.

Right-to-left text

One of the other focus areas for this release has been improvements for right-to-left text.

  • When creating a PDF, right-to-left content is always written to the page in logical order - previously it was written left-to-right. There will be no difference visually, but when it comes to extracting text, including for accessibility purposes, the fact the content is now coming out in logical order means that even when the PDF is not tagged, the text should extract correctly. We also fixed an issue where visually identical characters in Farsi and Arabic were sometimes given the wrong code. An equivalent for latin scripts would be mis-identifying the Greek Omega symbol Ω (U+03A9) as the Ohm symbol Ω (U+2126) - visually identical, but very confusing for a screen reader.
  • When extracting text from tagged documents, we've adopted what testing has shown are some general principles that will give correct results. Arabic and Hebrew PDFs are created in many, many different ways and very few of them have accessibility in mind. Not all will give good results, but in general we're seeing improvements in files created by Microsoft Office, Libre Office and "Print-to-PDF" from Chrome and Firefox. In particular we are now converting the shaped "display" forms for arabic glyphs back to their nominal forms when extracting.

This is a topic that is under active discussion within the PDF Association, and more changes may come as a result. But for now we're generally satisfied that the new approach gives better results. It has meant some changes to the way text from non-tagged documents are extracted, but these mostly just apply to whitespace.

PDF/UA rules

The PDF/UA Technical Working Group (BFO is an active member) has been discussing the exact rules around the "Scope" attribute for table headers, which have led to some changes to the PDF/UA validation profiles. These apply mostly to edge cases, and documents that were previously considered well-formed will generally remain so.

Arlington Model updates

We've updated the Arlington Model rules for the first time in two years. Although this is only ever informative in our API, it's remains a useful way of looking at PDF and is run as part of our OutputProfiler class when validating a PDF. We also, finally, identified a thread-race that could rarely cause an issue to be identified twice, fixing a long-standing annoyance in our regression testing.

PDF to HTML derivation

There have been quite a few improvements to our PDF to HTML derivation routines, via the HtmlDerivation class. This is an area we're working on actively and the improvements are based on testing with a good number of documents: improvements include proper extraction of PDF Figure objects, and where those consist of white content (i.e. white text on a colored background), we will automatically outline it to make it visible above the white background of the HTML.

Viewer event changes

Our Swing viewer classes are driven by Events, like most Swing applications, but if an AWT event (like a mouse-click on a button) resulted in a PDF Property Change event indicating a change to the PDF model - such as a form field field changing value - there was no way to correlate one event with the other. This made it impossible to determine if a field changed due to a user-interaction, or programatically.

We emit a standard Java PropertyChangeEvent from our API when the model changes, and this has no way to identify the cause of the event. But it does have a Propagation ID which is "reserved for future use" according to the Java API doucumentation. We're now setting it to the AWT event that triggered the change in the PDF model - a slightly clunky solution to a real problem which we hope will be useful to some.

Summary

More detail is in the CHANGELOG. These changes will be applied to new Report Generator and BFO Publisher releases in the very next future. Downlaod from https://bfo.com/download.