Creation of PDF/UA and PDF/A-3a documents

What on earth is PDF/UA

One of the more recent "sub-standards" of PDF to emerge from ISO is PDF/UA, which is also known as ISO-14289. The "UA" here stands for Universal Accessibility, and like PDF/A, PDF/X, PDF/E etc., PDF/UA imposes a particular set of rules on how the PDF is created: in this case, rules intended to make reading a PDF easier for those using assistive technology, such as screen readers for the partially sighted.

The PDF Association (of which BFO is a member) have published an excellent primer on this as an e-book: PDF/UA in a Nutshell which goes into a lot more detail than we can here.

Even if PDF/UA is still unfamiliar, you may have heard of some of these initiatives.

  • If you're dealing with US government documents you've probably encountered Section 508, the requirement for US Government documents to be accessible for people with disabilities.
  • The EU has EN501549, a similar standard for ICT procurement in the EU, although unlike Section 508 this is currently voluntary.
  • You may also have heard of Web Content Accessibility Guidelines, an initiative to ensure that online content (which can include PDFs) is accessible.

All of the above guidelines are fairly general, and PDF/UA (described in ISO-14289) is the set of PDF-specific requirements to meet these guidelines. So a PDF/UA-compliant document is compliant with Section 508 and WCAG.

So how does PDF/UA relate to PDF/A?

The three releases of PDF/A to date have all specified a conformance level, and up until now our API has only supported conformance level "B". Conformance level "A" is stricter, and requires the PDF content to be tagged, to provide some structure to the content of the PDF. This is what PDF/A-1a, PDF/A-2a and PDF/A-3a have in common with PDF/UA, and why our 2.20 release adds support both creating and validating PDF/A-1a, PDF/A-2a and PDF/A-3a documents.

Given the goal for both specifications is to ensure the semantic content of the PDF is known and accessible, we were surprised to find quite a few differences in how this was achieved in the two specifications: the section of the PDF/A specification referring to tagged PDF compliance insists only that a conforming file "... meets all of the requirements set forth for Tagged PDF in ISO 32000-1:2008, 14.8". It turns out that most of these requirements are more like guidelines really, and in this respect our PDF/A profiles will match Acrobat: to achieve conformance level A, a PDF/A document must have a structure in place that uses only the "standard" tags. There are no restrictions on how these tags are applied.

PDF/UA has a tighter, but not incompatible set of restrictions, which makes it possible to create a PDF that meets the requirements of both specifications.

The StructuredPDF.java example included with our download shows how to do this by creating a PDF that is both PDF/A-3a and PDF/UA compliant. The ability to target two different conformance profiles is new in this release, and quite easy to do:

  ICC_Profile icc = ICC_Profile.getInstance(ColorSpace.CS_sRGB);
  OutputProfile profile = new OutputProfile(OutputProfile.PDFA3a, "sRGB", null, "http://www.color.org", null, icc);
  profile.merge(OutputProfile.PDFUA1);
  PDF pdf = new PDF(profile);
  

How to create PDF/UA documents with the BFO PDF Library

The most conspicuous requirement is for the PDF to be "Tagged" with structural content. This interleaves an XML-like tag hierarchy into the document content, assigning text and graphics to familiar elements like Paragraphs and Articles. This must be done while the PDF is being created: although it's possible to add these tags to the document after creation with tools like Acrobat, it's is not something we'd expect to be done programmatically as it requires visual analysis of the document.

With our API, adding these tags is done with the beginTag and endTag methods on the PDFPage, PDFCanvas and LayoutBox classes, to inject the XML-like tag structure into the PDF content while it's being created.

The use of these methods is pretty simple, and as described above, we include a full example with our PDF API called StructuredPDF.java which shows how they work.

There are other technical requirements relating to how the tree content is structured; like HTML, a TD must be inside a TR, for instance. Unfortunately the language in this section of the specification is not as exact as we would like (an assertion that needs backing up, for which I'd direct you to the footnote at the end of this article). The result of this ambiguity is that there are points on which our tool, Acrobat and the PDF-UA specific PDF Accessibility Checker (PAC) tool disagree. Generally these are edge-cases, and those creating regular tagged documents, rather than test cases, should have no issues.

There are other requirements for PDF/UA other than tags, some of which are listed below:

  1. All fonts have to be embedded. This requirement will come as no surprise to anyone that has worked with PDF/A. Unembedded fonts are very much viewer-dependent. (Aside: I'm writing this in March 2017, some 24 years after the PDF specification was first published, yet in the last month we've had at least two emails from customers concerned about characters missing from their PDFs, both caused by a sub-standard font installed on their customer's system. In one case the "minus" character was missing from an engineering document, which frankly has disaster written all over it. I still wince thinking about this one. If you're creating a PDF for general distribution, you should always embed your fonts - PDF/UA or not.)
  2. The document language must be set. This is easily done by calling PDF.setLocale() to set the overall language of the PDF. If the document is in multiple languages, text added to a LayoutBox can override this Locale as necessary.
  3. The document title must be set and displayed in the window title. The title can be easily set by calling pdf.setInfo("Title", "An Accessible PDF"), and to ask the PDF viewer to display it in the window title, pdf.setOption("view.displayDocTitle", Boolean.TRUE). Provided you set a title, the rest will be done for you if you're targeting the PDF/UA profile.
  4. All annotations that don't have text content must have a description. This is good practice even for non-accessible documents, and is done by calling PDFAnnotation.setContents(). The primary case for this is hyperlinks, but it can be done for any annotation. This is similar in function to the HTML "alt" attribute.

Meet the requirements above and you'll be able to generate a PDF that is PDF/UA compliant, and validates in Acrobat (the "Accessibilty Report") as well as more specific PDF/UA tools like the PAC utility.

How to verify compliance of a PDF/UA document

A key difference between PDF/UA and something like PDF/A is that not all of the requirements for PDF/UA can be verified electronically. For example, the tags added to the PDF must be in logical reading order, which is not something we can establish.

Acrobat seems to acknowledge this by putting its PDF/UA validation functionality under the Accessibility tab, rather than in Preflight with the PDF/A validators. I think this is a shame myself, and I hope that PDF/UA compliance checking will become a "first class" test as for PDF/A in future releases of Acrobat.

Our API doesn't make this distinction, and verifying a document that claims to be PDF/UA compliant is done the same way as with PDF/A. The Dump.java example we include with the PDF Library package will verify any profiles the PDF claims to meet - here's an excerpt showing how we identify this:

      if (profile.isSet(OutputProfile.Feature.InfoMeetsPDFA2a)) {
         // validate PDF against PDF/A-2a
      } else if (profile.isSet(OutputProfile.Feature.InfoMeetsPDFA2b)) {
         // validate PDF against PDF/A-2b
      }
      if (profile.isSet(OutputProfile.Feature.InfoMeetsPDFUA1)) {
         // validate PDF against PDF/UA-1
      }
  

In practice we expect PDF/UA validation to be done less often than validating PDF/A, but if it becomes a requirement of your workflow then our API will be able to help.

Other uses for PDF structure

PDF/UA and PDF/A are the big uses for the Structure Tree which underpins the Tags described above. We do have two new methods in this release which relate to the document structure: the first is PDFParser.getStructureTree(). This returns any structured content in the PDF (PDF/UA compliant or otherwise) as a W3C DOM Document. How useful this will be depends on how well structured the tags were when they were added. Based on our testing so far, the answer to this is very mixed.

The second is PDF.rebuildStructureTree(), which will attempt to rebuild the internal data structures. This is particularly useful if you've been moving content from one document to another: with this release and this method, it's possible to move a page with structured content from one PDF to another and keep the structured content intact.

We don't do this automatically because it's an expensive operation, and usually it's not required; for example, if you're concatenating a number of PDFs together, chances are you don't really care if one of them happens to have an internal structure. For the few times where this does matter and structure is to be preserved, simply call this method before calling render().

Summary

PDF/UA has been around for four years, but we have been seeing a gradual increase in interest from our customers over the last year or so. It's an exciting and underused area of PDF, and we're very pleased to be able to finally support it properly.

We plan to add the creation of Tag to our Report Generator in the near future. Until then we hope the features described above will be useful to those customers that have a need to create accessible documents today.

For those customers we would recommend the StructuredPDF.java PDF linked to above as a starting point, along with the PDFCanvas.beginTag() API documentation.

Footnote: ambiguity in the spec

While likely to be of interest to very, very few people, I'm going to back up my claim above that the specification is a bit inexact in this area below. It doesn't help that ISO32000, ISO14289 and Matterhorn Protocol documents all define the same thing in slightly different ways and yet reference each-other; however the definitive document is ISO14289.

This isn't intended as a dig at any of the authors of the software or specifications below, and isn't saying anything that hasn't already been said before: the language referred to below has already been significantly improved in the upcoming PDF 2.0 specification. It's simply meant to illustrate the unintended consequences of a specification that a) references other specifications that are written with a different set of terminology - I'm looking at you, ISO32000, b) are published after (or, worse, written to match the behaviour of) the reference implementation (Acrobat), and c) describe the test conditions with words, rather than by reference to a series of test-cases.

If nothing else, it may illustrate for our customers - and the customers of other PDF products in the marketplace - why the various products don't always agree on the specification. The PDF Association is behind active efforts to improve this situation.

Here are a few examples:

  1. ISO14289, section 7.5, states "Tables should include headers." - should is understood here to mean a recommendation, but Acrobat DC interprets this is a requirement: although it only seems to require a single TH for each table, which matches neither interpretation. We treat it as a recommendation, and the PAC tool doesn't care.
  2. All tagged content "...shall be tagged as defined in ISO 32000-1:2008, 14.8." - here the PDF/UA specification normatively references the PDF specification. Table 337 states a TR is "...a row of headings or data in a table. It may contain table header cells and table data cells (structure types TH and TD)." So is a TR allowed to contain plain text that is not in a cell? It's not clear; it depends on the definition of "may", which is used in ISO32000 in a different sense to the usage in ISO14289. Acrobat sometimes disallows it depending on what else is in the table; the PAC tool appears to be unsure, as it issues a non-fatal warning. The language in the upcoming PDF 2.0 specification has been revised and is clearer, although still not unambiguous. We read it to mean "text content is allowed", and that's how we apply it.
  3. To assist with the process of testing the specification the PDF/UA competence center gave us the Matterhorn Protocol 1.02, describing a number of tests "...encompassing file format requirements specified in PDF/UA-1." However test 01-003, "Content marked as Artifact is present inside tagged content." doesn't match the language of the specification, which states "Artifacts shall not be tagged in the structure tree.". It is possible for an artifact to be inside tagged content but not referenced from the structure tree; here the test description is incorrect, and Acrobat, PAC and our API all agree.