<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="blog.xsl"?>
<article>
 <title>Creation of PDF/UA and PDF/A-3a documents</title>
 <subtitle/>
 <excerpt>
 Our latest 2.20 release of the PDF Library adds full support for the creation of PDF/UA compliant documents,
 and hopefully this article will shed a little light on what they are. 
 </excerpt>
 <time>2017-04-19T11:00:00</time>
 <author>mike</author>
 <category>news</category>
 <category>pdf</category>
 <tags>PDFA PDFUA accessibility outputprofile</tags>
 <body>
  <h1>What on earth is PDF/UA</h1>
  <p>
   One of the more recent "sub-standards" of PDF to emerge from ISO is <i>PDF/UA</i>,
   which is also known as 
   <a href="https://www.iso.org/standard/64599.html">ISO-14289</a>.
   The "UA" here stands for <i>Universal Accessibility</i>, and like PDF/A, PDF/X,
   PDF/E etc., PDF/UA
   imposes a particular set of rules on how the PDF is created: in this case, rules
   intended to make reading a PDF easier for those using assistive technology, such as screen
   readers for the partially sighted.
  </p><p>
   The <a href="http://www.pdfa.org">PDF Association</a> (of which BFO is a member)
   have published an excellent primer on this as an e-book: 
   <a href="https://www.pdfa.org/publication/pdfua-in-a-nutshell/">PDF/UA in a Nutshell</a>
   which goes into a lot more detail than we can here.
  </p><p>
    Even if PDF/UA is still unfamiliar, you may have heard of some of these initiatives.
  </p>
  <ul>
   <li>
     If you're dealing with US government documents you've probably encountered
     <a href="https://www.section508.gov">Section 508</a>, the requirement for US Government
     documents to be accessible for people with disabilities.
   </li>
   <li>
     The EU has 
     <a href="http://mandate376.standards.eu/standard">EN501549</a>,
     a similar standard for ICT procurement in the EU, although unlike Section 508 this is currently voluntary.
   </li>
   <li>
     You may also have heard of 
     <a href="https://www.w3.org/WAI/intro/wcag">Web Content Accessibility Guidelines</a>, an initiative 
     to ensure that online content (which can include PDFs) is accessible.
   </li>
  </ul>
  <p>
    All of the above guidelines are fairly general, and PDF/UA (described in
    <a href="https://www.iso.org/standard/64599.html">ISO-14289</a>) is the set of PDF-specific requirements
    to meet these guidelines.  So a PDF/UA-compliant document is compliant with Section 508 and WCAG.
  </p>

  <h2>So how does PDF/UA relate to PDF/A?</h2>
  <p>
    The three releases of PDF/A to date have all specified a <i>conformance level</i>, and up
    until now our API has only supported conformance level "B". Conformance level "A" is stricter,
    and requires the PDF content to be <i>tagged</i>, to provide some structure to the
    content of the PDF. This is what PDF/A-1a, PDF/A-2a and PDF/A-3a have
    in common with PDF/UA, and why our 2.20 release adds support both creating and validating
    PDF/A-1a, PDF/A-2a and PDF/A-3a documents.
  </p><p>
    Given the goal for both specifications is to ensure the <i>semantic content</i> of the PDF
    is known and accessible, we were surprised to find quite a few differences in how this was
    achieved in the two specifications: the section of the PDF/A specification referring to
    tagged PDF compliance insists only that a conforming file
    "<i>... meets all of the requirements set forth for Tagged PDF in ISO 32000-1:2008, 14.8</i>".
    It turns out that most of these requirements are more like guidelines really, and in this
    respect our PDF/A profiles will match Acrobat:
    to achieve conformance level A, a PDF/A document must have a structure in place that uses
    only the "standard" tags. There are no restrictions on how these tags are applied.
  </p><p>
    PDF/UA has a tighter, but not incompatible set of restrictions, which makes it possible to create
    a PDF that meets the requirements of both specifications.
  </p><p>
    The
    <a href="StructuredPDF.java" viewtext="true">StructuredPDF.java</a>
    example included with our <a href="/download">download</a> shows how to do this by creating
    a PDF that is both PDF/A-3a and PDF/UA compliant. The
    ability to target two different conformance profiles is new in this release, and quite easy to do:
  </p>
  <pre class="brush:java; highlight:3">
  ICC_Profile icc = ICC_Profile.getInstance(ColorSpace.CS_sRGB);
  OutputProfile profile = new OutputProfile(OutputProfile.PDFA3a, "sRGB", null, "http://www.color.org", null, icc);
  profile.merge(OutputProfile.PDFUA1);
  PDF pdf = new PDF(profile);
  </pre>

  <h2>How to create PDF/UA documents with the BFO PDF Library</h2>
  <p>
    The most conspicuous requirement is for the PDF to be "Tagged" with structural content. This interleaves an
    XML-like tag hierarchy into the document content, assigning text and graphics to familiar elements like
    <i>Paragraphs</i> and <i>Articles</i>.
    This must be done while the PDF is being created: although it's possible to add these
    tags to the document after creation with tools like Acrobat, it's is not something we'd expect to be
    done programmatically as it requires visual analysis of the document.
  </p><p>
     With our API, adding these tags is done with the
    <a href="/products/pdf/docs/api/org/faceless/pdf2/PDFCanvas.html#beginTag-java.lang.String-java.util.Map-">beginTag</a>
    and endTag methods on the 
    <a href="/products/pdf/docs/api/index.html?org/faceless/pdf2/PDFPage.html">PDFPage</a>,
    <a href="/products/pdf/docs/api/index.html?org/faceless/pdf2/PDFCanvas.html">PDFCanvas</a> and
    <a href="/products/pdf/docs/api/index.html?org/faceless/pdf2/LayoutBox.html">LayoutBox</a>
    classes, to inject the XML-like tag structure into the PDF content while it's being created.
  </p><p>
    The use of these methods is pretty simple, and as described above, we include a full example
    with our PDF API called <a href="StructuredPDF.java" viewtext="true">StructuredPDF.java</a>
    which shows how they work.
  </p><p>
    There are other technical requirements relating to how the tree content is structured; like HTML,
    a <b>TD</b> must be inside a <b>TR</b>, for instance. Unfortunately the language in this section of
    the specification is not as exact as we would like
    <i>(an assertion that needs backing up, for which I'd direct you to the footnote at the end of this article).</i>
    The result of this ambiguity is that there are points on which our tool, Acrobat and the PDF-UA specific
    <a href="http://www.access-for-all.ch/en/pdf-lab/pdf-accessibility-checker-pac.html">PDF Accessibility Checker (PAC)</a>
    tool disagree. Generally these are edge-cases, and those creating regular
    tagged documents, rather than test cases, should have no issues.
  </p><p>
    There are other requirements for PDF/UA other than tags, some of which are listed below:
  </p>
  <ol class="expanded">
    <li>
      <b>All fonts have to be embedded</b>.
      This requirement will come as no surprise to anyone that has worked with PDF/A. Unembedded
      fonts are very much viewer-dependent. <i>(Aside: I'm writing this in March 2017, some 24 years after the PDF
      specification was first published, yet in the last month we've had at least two emails from customers
      concerned about characters missing from their PDFs, both caused by a sub-standard font installed
      on their customer's system. In one case the "minus" character was missing from an engineering document,
      which frankly has disaster written all over it. I still wince thinking about this one.
      If you're creating a PDF for general distribution, <b>you should always embed your fonts</b> - PDF/UA or not.)</i>
    </li>
    <li>
      <b>The document language must be set</b>.
      This is easily done by calling
      <a href="/products/pdf/docs/api/org/faceless/pdf2/PDF.html#setLocale-java.util.Locale-">PDF.setLocale()</a>
      to set the overall language of the PDF. If the document is in multiple languages, text added to a
      <a href="/products/pdf/docs/api/index.html?org/faceless/pdf2/LayoutBox.html">LayoutBox</a>
      can override this Locale as necessary.
    </li>
    <li>
      <b>The document title must be set and displayed in the window title</b>.
      The title can be easily set by calling <code class="brush:java">pdf.setInfo("Title", "An Accessible PDF")</code>,
      and to ask the PDF viewer to display it in the window title, 
      <code class="brush:java">pdf.setOption("view.displayDocTitle", Boolean.TRUE)</code>. Provided you set a
      title, the rest will be done for you if you're targeting the PDF/UA profile.
    </li>
    <li>
      <b>All annotations that don't have text content must have a description</b>.
      This is good practice even for non-accessible documents, and is done by calling
      <a href="/products/pdf/docs/api/org/faceless/pdf2/PDFAnnotation.html#setContents-java.lang.String-">PDFAnnotation.setContents()</a>.
      The primary case for this is hyperlinks, but it can be done for any annotation. This is similar
      in function to the HTML "alt" attribute.
    </li>
  </ol>
  <p>
    Meet the requirements above and you'll be able to generate a PDF that is PDF/UA compliant, and validates
    in Acrobat (the "Accessibilty Report") as well as more specific PDF/UA tools like the
    <a href="http://www.access-for-all.ch/en/pdf-lab/pdf-accessibility-checker-pac.html">PAC</a> utility.
  </p>

  <h2>How to verify compliance of a PDF/UA document</h2>
  <p>
    A key difference between PDF/UA and something like PDF/A is that not all of the requirements for PDF/UA can
    be verified electronically. For example, the tags added to the PDF must be in logical reading
    order, which is not something we can establish.
  </p><p>
    Acrobat seems to acknowledge this by putting its PDF/UA validation functionality under the
    <i>Accessibility</i> tab, rather than in <i>Preflight</i> with the PDF/A validators. I think this is a
    shame myself, and I hope that PDF/UA compliance checking will become a "first class" test as
    for PDF/A in future releases of Acrobat.
  </p><p>
    Our API doesn't make this distinction, and verifying a document that claims to be PDF/UA compliant
    is done the same way as with PDF/A. The <i>Dump.java</i> example we include with the PDF Library
    package will verify any profiles the PDF claims to meet - here's an excerpt showing how we identify
    this:
  </p>
  <pre class="brush:java">
      if (profile.isSet(OutputProfile.Feature.InfoMeetsPDFA2a)) {
         // validate PDF against PDF/A-2a
      } else if (profile.isSet(OutputProfile.Feature.InfoMeetsPDFA2b)) {
         // validate PDF against PDF/A-2b
      }
      if (profile.isSet(OutputProfile.Feature.InfoMeetsPDFUA1)) {
         // validate PDF against PDF/UA-1
      }
  </pre>
  <p>
    In practice we expect PDF/UA validation to be done less often than validating PDF/A,
    but if it becomes a requirement of your workflow then our API will be able to help.
  </p>

  <h2>Other uses for PDF structure</h2>
  <p>
    PDF/UA and PDF/A are the big uses for the <i>Structure Tree</i> which underpins the <i>Tags</i>
    described above. 
    We do have two new methods in this release which relate to the document structure: the first is
    <a href="/products/pdf/docs/api/org/faceless/pdf2/PDFParser.html#getStructureTree--">PDFParser.getStructureTree()</a>.
    This returns any structured content in the PDF (PDF/UA compliant or otherwise) as a W3C DOM
    <a href="http://docs.oracle.com/javase/8/docs/api/org/w3c/dom/Document.html?is-external=true">Document</a>.
    How useful this will be depends on how well structured the tags were when they were added. Based on
    our testing so far, the answer to this is very mixed.
  </p><p>
    The second is 
    <a href="/products/pdf/docs/api/org/faceless/pdf2/PDF.html#rebuildStructureTree--">PDF.rebuildStructureTree()</a>,
    which will <i>attempt</i> to rebuild the internal data structures. This is particularly useful
    if you've been moving content from one document to another: with this release and this method,
    it's possible to move a page with structured content from one PDF to another and keep the structured
    content intact.
  </p><p>
    We don't do this automatically because it's an expensive operation, and usually it's not
    required; for example, if you're concatenating a number of PDFs together, chances are
    you don't really care if one of them happens to have an internal structure. For the few times where
    this does matter and structure is to be preserved, simply call this method before calling <code>render()</code>.
  </p>

  <h2>Summary</h2>
  <p>
    PDF/UA has been around for four years, but we have been seeing a gradual increase in interest from our
    customers over the last year or so.
    It's an exciting and underused area of PDF, and we're very pleased to be able
    to finally support it properly.
  </p><p>
    We plan to add the creation of Tag to our <a href="/products/report/">Report Generator</a> in the near future.
    Until then we
    hope the features described above will be useful to those customers that have a need to create accessible
    documents today.
  </p><p>
    For those customers we would recommend the 
    <a href="StructuredPDF.java" viewtext="true">StructuredPDF.java</a> PDF linked to above as a starting point,
    along with the 
    <a href="/products/pdf/docs/api/org/faceless/pdf2/PDFCanvas.html#beginTag-java.lang.String-java.util.Map-">PDFCanvas.beginTag()</a>
    API documentation.
  </p>

  <blockquote style="font-size:smaller">
  <h2>Footnote: ambiguity in the spec</h2>
  <p>
    While likely to be of interest to very, very few people, I'm going to back up my claim above that the specification
    is a bit inexact in this area below. It doesn't help that ISO32000, ISO14289 and <i>Matterhorn Protocol</i> documents
    all define the same thing in slightly different ways and yet reference each-other; however the definitive document is
    ISO14289.
  </p><p>
    This isn't intended as a dig at any of the authors of the software or specifications below, and 
    isn't saying anything that hasn't already been said before: the
    language referred to below has already been significantly improved in the upcoming PDF 2.0 specification.
    It's simply meant to illustrate the unintended consequences of a specification that a) references other
    specifications that are written with a different set of terminology - I'm looking at you,  
    <a href="/blog/2013/03/14/the_firefox_pdf_js_viewer/">ISO32000</a>, b) are published <i>after</i> (or, worse,
    written to match the behaviour of) the reference implementation (Acrobat), and c) describe the test conditions
    with words, rather than by reference to a series of test-cases.
  </p><p>
    If nothing else, it may illustrate for our customers - and the customers of other PDF products in the
    marketplace - why the various products don't always agree on the specification. The PDF Association is
    behind active efforts to improve this situation.
  </p><p>
    Here are a few examples:
  </p>
  <ol class="expanded">
  <li>
    ISO14289, section 7.5, states <i>"Tables should include headers."</i> - <b>should</b> is
    <a href="https://www.iso.org/foreword-supplementary-information.html">understood</a>
    here to mean a <i>recommendation</i>, but Acrobat DC interprets this is a <i>requirement</i>:
    although it only seems to require a single TH for each table, which matches neither 
    interpretation. We treat it as a recommendation, and the PAC tool doesn't care.
  </li>
  <!--
  <li>
    The same section states that <i>"...if the table’s structure is not determinable via Headers and IDs, then
    structure elements of type TH shall have a Scope attribute."</i>. Here <b>shall</b> is a normative requirement, but there
    is no mention of this "determination" process in ISO32000, to which this section refers; the best we get is that the
    Scope attribute "<i>...shall reflect whether the header cell applies to the rest of the cells in the row that contains
    it, the column that contains it, or both the row and the column that contain it.</i>" (table 349), along with the note
    that <i>"...the association of headers with rows and columns of data is typically determined heuristically by
    applications."</i> (14.8.4.3.4).
    Acrobat doesn't seem to require a Scope attribute, even in cases where it would be required to remove ambiguity, and
    neither does PAC. In the absence of a description of this algorithm, neither can we.
  </li>
  -->
  <li>
  All tagged content <i>"...shall be tagged as defined in ISO 32000-1:2008, 14.8."</i> - here the PDF/UA
  specification normatively references the PDF specification. Table 337 states a TR is <i>"...a row of
  headings or data in a table. It may contain table header cells and table data cells (structure types TH and
  TD)."</i> So is a TR allowed to contain plain text that is not in a cell? It's not clear; it depends
  on the definition of "<i>may</i>", which is used in ISO32000 in a different sense to the usage in ISO14289.
  Acrobat sometimes disallows it depending on what else is in the table; the PAC tool appears to be unsure,
  as it issues a non-fatal warning. The language in the upcoming PDF 2.0 specification has been revised and is
  clearer, although still not unambiguous. We read it to mean "text content is allowed", and that's how we
  apply it.
  </li>
  <li>
  To assist with the process of testing the specification the PDF/UA competence center gave us the Matterhorn
  Protocol 1.02, describing a number of tests "<i>...encompassing file format requirements specified in
  PDF/UA-1."</i> However test 01-003, "<i>Content marked as Artifact is present inside tagged content</i>."
  doesn't match the language of the specification, which states "<i>Artifacts shall not be tagged in the
  structure tree.</i>". It is possible for an artifact to be <b>inside</b> tagged content but not referenced
  from the structure tree; here the test description is incorrect, and Acrobat, PAC and our API all agree.
  </li>
  </ol>
  </blockquote>


 </body>
</article>