What on earth is PDF/UA
One of the more recent "sub-standards" of PDF to emerge from ISO is PDF/UA, which is also known as ISO-14289. The "UA" here stands for Universal Accessibility, and like PDF/A, PDF/X, PDF/E etc., PDF/UA imposes a particular set of rules on how the PDF is created: in this case, rules intended to make reading a PDF easier for those using assistive technology, such as screen readers for the partially sighted.
The PDF Association (of which BFO is a member) have published an excellent primer on this as an e-book: PDF/UA in a Nutshell which goes into a lot more detail than we can here.
Even if PDF/UA is still unfamiliar, you may have heard of some of these initiatives.
- If you're dealing with US government documents you've probably encountered Section 508, the requirement for US Government documents to be accessible for people with disabilities.
- The EU has EN501549, a similar standard for ICT procurement in the EU, although unlike Section 508 this is currently voluntary.
- You may also have heard of Web Content Accessibility Guidelines, an initiative to ensure that online content (which can include PDFs) is accessible.
All of the above guidelines are fairly general, and PDF/UA (described in ISO-14289) is the set of PDF-specific requirements to meet these guidelines. So a PDF/UA-compliant document is compliant with Section 508 and WCAG.
So how does PDF/UA relate to PDF/A?
The three releases of PDF/A to date have all specified a conformance level, and up until now our API has only supported conformance level "B". Conformance level "A" is stricter, and requires the PDF content to be tagged, to provide some structure to the content of the PDF. This is what PDF/A-1a, PDF/A-2a and PDF/A-3a have in common with PDF/UA, and why our 2.20 release adds support both creating and validating PDF/A-1a, PDF/A-2a and PDF/A-3a documents.
Given the goal for both specifications is to ensure the semantic content of the PDF is known and accessible, we were surprised to find quite a few differences in how this was achieved in the two specifications: the section of the PDF/A specification referring to tagged PDF compliance insists only that a conforming file "... meets all of the requirements set forth for Tagged PDF in ISO 32000-1:2008, 14.8". It turns out that most of these requirements are more like guidelines really, and in this respect our PDF/A profiles will match Acrobat: to achieve conformance level A, a PDF/A document must have a structure in place that uses only the "standard" tags. There are no restrictions on how these tags are applied.
PDF/UA has a tighter, but not incompatible set of restrictions, which makes it possible to create a PDF that meets the requirements of both specifications.
The StructuredPDF.java example included with our download shows how to do this by creating a PDF that is both PDF/A-3a and PDF/UA compliant. The ability to target two different conformance profiles is new in this release, and quite easy to do:
ICC_Profile icc = ICC_Profile.getInstance(ColorSpace.CS_sRGB); OutputProfile profile = new OutputProfile(OutputProfile.PDFA3a, "sRGB", null, "http://www.color.org", null, icc); profile.merge(OutputProfile.PDFUA1); PDF pdf = new PDF(profile);
How to create PDF/UA documents with the BFO PDF Library
The most conspicuous requirement is for the PDF to be "Tagged" with structural content. This interleaves an XML-like tag hierarchy into the document content, assigning text and graphics to familiar elements like Paragraphs and Articles. This must be done while the PDF is being created: although it's possible to add these tags to the document after creation with tools like Acrobat, it's is not something we'd expect to be done programmatically as it requires visual analysis of the document.
With our API, adding these tags is done with the beginTag and endTag methods on the PDFPage, PDFCanvas and LayoutBox classes, to inject the XML-like tag structure into the PDF content while it's being created.
The use of these methods is pretty simple, and as described above, we include a full example with our PDF API called StructuredPDF.java which shows how they work.
There are other technical requirements relating to how the tree content is structured; like HTML, a TD must be inside a TR, for instance. Unfortunately the language in this section of the specification is not as exact as we would like (an assertion that needs backing up, for which I'd direct you to the footnote at the end of this article). The result of this ambiguity is that there are points on which our tool, Acrobat and the PDF-UA specific PDF Accessibility Checker (PAC) tool disagree. Generally these are edge-cases, and those creating regular tagged documents, rather than test cases, should have no issues.
There are other requirements for PDF/UA other than tags, some of which are listed below:
Meet the requirements above and you'll be able to generate a PDF that is PDF/UA compliant, and validates in Acrobat (the "Accessibilty Report") as well as more specific PDF/UA tools like the PAC utility.
How to verify compliance of a PDF/UA document
A key difference between PDF/UA and something like PDF/A is that not all of the requirements for PDF/UA can be verified electronically. For example, the tags added to the PDF must be in logical reading order, which is not something we can establish.
Acrobat seems to acknowledge this by putting its PDF/UA validation functionality under the Accessibility tab, rather than in Preflight with the PDF/A validators. I think this is a shame myself, and I hope that PDF/UA compliance checking will become a "first class" test as for PDF/A in future releases of Acrobat.
Our API doesn't make this distinction, and verifying a document that claims to be PDF/UA compliant is done the same way as with PDF/A. The Dump.java example we include with the PDF Library package will verify any profiles the PDF claims to meet - here's an excerpt showing how we identify this:
if (profile.isSet(OutputProfile.Feature.InfoMeetsPDFA2a)) { // validate PDF against PDF/A-2a } else if (profile.isSet(OutputProfile.Feature.InfoMeetsPDFA2b)) { // validate PDF against PDF/A-2b } if (profile.isSet(OutputProfile.Feature.InfoMeetsPDFUA1)) { // validate PDF against PDF/UA-1 }
In practice we expect PDF/UA validation to be done less often than validating PDF/A, but if it becomes a requirement of your workflow then our API will be able to help.
Other uses for PDF structure
PDF/UA and PDF/A are the big uses for the Structure Tree which underpins the Tags described above. We do have two new methods in this release which relate to the document structure: the first is PDFParser.getStructureTree(). This returns any structured content in the PDF (PDF/UA compliant or otherwise) as a W3C DOM Document. How useful this will be depends on how well structured the tags were when they were added. Based on our testing so far, the answer to this is very mixed.
The second is PDF.rebuildStructureTree(), which will attempt to rebuild the internal data structures. This is particularly useful if you've been moving content from one document to another: with this release and this method, it's possible to move a page with structured content from one PDF to another and keep the structured content intact.
We don't do this automatically because it's an expensive operation, and usually it's
not
required; for example, if you're concatenating a number of PDFs together, chances
are
you don't really care if one of them happens to have an internal structure. For the
few times where
this does matter and structure is to be preserved, simply call this method before
calling render()
.
Summary
PDF/UA has been around for four years, but we have been seeing a gradual increase in interest from our customers over the last year or so. It's an exciting and underused area of PDF, and we're very pleased to be able to finally support it properly.
We plan to add the creation of Tag to our Report Generator in the near future. Until then we hope the features described above will be useful to those customers that have a need to create accessible documents today.
For those customers we would recommend the StructuredPDF.java PDF linked to above as a starting point, along with the PDFCanvas.beginTag() API documentation.
Footnote: ambiguity in the spec
While likely to be of interest to very, very few people, I'm going to back up my claim above that the specification is a bit inexact in this area below. It doesn't help that ISO32000, ISO14289 and Matterhorn Protocol documents all define the same thing in slightly different ways and yet reference each-other; however the definitive document is ISO14289.
This isn't intended as a dig at any of the authors of the software or specifications below, and isn't saying anything that hasn't already been said before: the language referred to below has already been significantly improved in the upcoming PDF 2.0 specification. It's simply meant to illustrate the unintended consequences of a specification that a) references other specifications that are written with a different set of terminology - I'm looking at you, ISO32000, b) are published after (or, worse, written to match the behaviour of) the reference implementation (Acrobat), and c) describe the test conditions with words, rather than by reference to a series of test-cases.
If nothing else, it may illustrate for our customers - and the customers of other PDF products in the marketplace - why the various products don't always agree on the specification. The PDF Association is behind active efforts to improve this situation.
Here are a few examples: