BFO PDF Library 2.24, with improved PDF/UA
We released the PDF Library 2.24 today with vastly improved support for editing and verifying the PDF Structure Tree. This is the structure that is used to make a PDF accessible (among other uses), and is required to achieve both PDF/UA and PDF/A-1a, 2a and 3a compliance .
We've had some support for this structure for quite a while now. So what's new?
The Structure Tree: an overview
PDF without a Structure Tree is just a sequence of operations on a page: move here, set this font, draw this text, draw the line, add this image. So imposing a structure onto this needs a little bit of lateral thinking.
atts = new HashMap<String,Object>(); atts.put("id", "p1"); page.beginTag("Document", null); page.beginTag("P", atts); page.setStyle(style); page.drawText("Hello"); page.endTag(); atts.put("id", "s1"); page.beginTag("Span", atts); page.drawRectangle(0, 0, 20, 40); page.endTag(); atts.put("id", "p2"); page.beginTag("P", atts); page.drawImage(img, 0, 0, 1, 1); page.endTag(); page.endTag();
1 0 0 rg % set color to red /P<<MCID 0>>BMC % begin section 0 BT % begin text /R1 24 Tf % set font R1, 24pt (Hello)Tj % draw text hellow ET % end text EMC % end section 0 /Span<<MCID 1>>BMC % begin section 1 0 0 20 40 re f % draw rectangle EMC % end section 0 /P<<MCID 2>>BMC % begin section 2 /R2 Do % draw image R2 EMC % end section 2
The way Adobe decided to do this was to add "markers" into the stream. Each page (or canvas) can be divided into marked sections, each with a unique number. A tree is then constructed seperately that points to those numbered sections. We represent each of these sequences in the Document returned from getStructureTree as elements in a special namespace. Here the tree you might get back from this method with the above code
<StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf"> <Document> <P id="p1"> <bfo:content mcid="0" page="0">Hello</bfo:content> </P> <Span id="s1"> <bfo:content mcid="1" page="0" /> </Span> <P id="p2"> <bfo:content mcid="2" page="0" /> </P> </Document> </StructTreeRoot>
This tree is live, so if you want to swap the order of the two paragraphs, or put the Span inside a Paragraph, this is easily done. For example:
Document doc = pdf.getStructureTree(); Element p1 = doc.getElementById("p1"); Element s1 = doc.getElementById("s1"); p1.appendChild(s1);
<StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf"> <Document> <P id="p1"> <bfo:content mcid="0" page="0">Hello</bfo:content> <Span id="s1"> <bfo:content mcid="1" page="0" /> </Span> </P> <P id="p2"> <bfo:content mcid="2" page="0" /> </P> </Document> </StructTreeRoot>
Why would you want to do this? One example is when combining pages from multiple documents. When moving pages from one PDF to another, the destination PDF will import just enough of the structure from the source PDF to include all the content on the imported pages. But it's likely this resulting structure won't accurately represent the desired result. Being able to edit the Document using the standard DOM package means any changes can be made to the DOM quickly and easily. Or at least as quickly and easily as you can do anything with the DOM package.
Quirks
The Document returned from the PDF is not a regular XML document, although we try to present it as one by using the DOM interface. There are some key differences you should be aware of if you're planning on working with this Document.
Call Document.normalizeDocument(), to incorporate changes to the PDF
As well as editing the Document tree directly via the DOM interface, it's possible
to add content
into the tree by calling the beginTag/endTag
methods, or by migrating pages into or out
of the PDF. These changes will not immediately be reflected in the Document, and neither
will the
automatic creation of namespace prefixes for PDF 2, extracted text and so on (see
below).
To ensure the Document you are looking at is complete, call Document.normalizeDocument()
after any of these changes and before you plan to analyse or edit the Document via
the DOM interface.
page.beginTag("P", null); page.drawText("Hello", 100, 800) page.endTag(); Document doc = pdf.getStructureTree(); dump(doc); // Where's my content? <StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf"/> doc.normalizeDocument(); dump(doc); // It's added in the call to normalizeDocument() <StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf"> <P> <bfo:content mcid="0" page="0">Hello</bfo:content> </P> </StructTreeRoot>
You can create Elements but not Text, and some Elements are read-only
With a regular DOM Document you could create a new Text node by calling Document.createTextNode
.
This can't be done with the Document returned from PDF.getStructureTree()
. Content in this
tree must exist on the page, and the only way to do that is to mark a section of the
page with
beginTag/endTag
as shown above.
Likewise, it's not possible to create or change attributes an Element in either the special "bfo" namespace or the root element, and it's not possible to create processing instructions, entities and so on.
Pseudo-namespaces are used for certain attributes
PDF defines several standard attributes which can be applied to elements, and groups
them into sets
by "Owner". For example, to set the number of rows a <TH> element spans, you would
set the
RowSpan
attribute with the Table
owner. We represent this in the tree as
an attribute with the "Table" prefix in the "urn:Table" namespace.
atts = new HashMap<String,Object>(); atts.put("Table:RowSpan", "2"); page.beginTag("TH", atts)
<StructTreeRoot xmlns:bfo="urn:bfopdf" xmlns="http://iso.org/pdf/ssn" xmlns:Table="urn:Table"> <P id="p1" Table:RowSpan="2" /> </StructTreeRoot>
The prefix "Table" as well as "Layout", "List", "PrintField" and "Artifact" are bound to special namespaces in this way, and cannot be reset.
PDF 1.x Documents may have no namespace or a fixed namespace.
Other than the pseudo-namespaces above and the magical "bfo" namespace we use for content, a StructureTree in PDF 1.x has no namespace. The concept isn't part of ISO 32000-1. However, that specification makes a distinction between a Structure Tree which meets a set of specific requirements (the same requirements used by PDF/UA and PDF/A) and one which does not.
Documents which
claim to meet these requirements set the "Marked" property under the Document Catalog
to true
.
When we load a PDF that makes this claim, we set the namespace on the root element
to
http://iso.org/pdf/ssn
(a value first defined in PDF 2.0, but specified to apply to
documents that match the requirements in ISO 32000-1). Documents that don't make this
claim have
no namespace.
PDF 2.x allows namespaces, but no namespace prefixes
PDF 2.x introduced a few changes in this area. The set of approved tags was changed
(some were added,
some removed), and namespaces are introduced for both elements and attributes. So
when we open a PDF 2.0
document that claims to meet the requirements outlined above, we set the namespace
to the value
http://iso.org/pdf2/ssn
, as defined in ISO 32000-2.
While namespaces are allowed, the concept of a prefix is not part of the specification. We will assign prefixes automatically to nodes in the tree to make the XML look correct, but they are not stored in the PDF. This has several consequences:
- All prefixes are defined on the root element.
- There is no need to set a prefix with an "xmlns" attribute (if you do, we'll migrate it to the root)
When manipulating the Document with the DOM package, namespaced elements and attributes
can
be created in the normal way. When creating tags with the beginTag/endTag
methods,
the namespace URI is specified as a prefix to the element or attribute name, seperated
with a
newline.
atts = new HashMap<String,Object>(); atts.put("http://a.com\nFoo", "val"); page.beginTag("http://b.com\nP", atts)
<StructTreeRoot xmlns:bfo="urn:bfopdf" xmlns="http://iso.org/pdf/ssn" xmlns:ns0="http://a.com" xmlns:ns1="http://b.com"> <ns0:P ns1:Foo="val" /> </StructTreeRoot>
Characters are allowed that are invalid in XML
It's possible to create an element or attribute with a name that is invalid in XML - containing spaces, punctuation and so on. We've actually seen quite a few documents constructed this way while testing, it seems to be something that's done by Adobe InDesign.
This won't cause a problem unless you are trying to import an element or attribute
from the Structure Tree
into a regular DOM, in which case illegal characters will throw an Exception. The
solution is to set
a parameter on the DomConfig
object, as shown below. The "fix-invalid-xml" parameter will
not change the values internally, but will change the way they are presented in the
DOM interface so
that they appear as legal XML values.
Element e = document.getElementById("id1"); System.out.println(e.getTagName(); // Output is "Tag name" - space is invalid! Element copy = dom.importNode(e); // Throws an exception. document.getDomConfig).setParameter("fix-invalid-xml", true); System.out.println(e.getTagName(); // Output is "Tag_name" - now it's valid. Element copy = dom.importNode(e); // Node is imported, all is well
Text content is not always included
Extracting the text from a newly loaded PDF is quite a slow operation, and it requires an "extended edition plus viewer " license. For that reason we don'always populate the <bfo:content> elements with their text content (we do if you've created the content yourself, of course - this only applies to PDFs that have been loaded).
To complete the tree you need to either set the "extract-text" parameter on the DomConfig
to true, or
call PDFParser.getStructureTree
instead of PDF.getStructureTree
(this approach exists
for legacy reasons; they do the same thing).
PDF pdf = new PDF(new PDFReader(new File("HelloWorld.pdf"))); Document doc = pdf.getStructureTree(); dump(doc); // No text within the bfo:content element <StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf"> <Document> <P id="p1"> <bfo:content mcid="0" page="0" /> </P> </Document> </StructTreeRoot> doc.getDomConfig().setParameter("extract-text", true); doc.normalizeDocument(); dump(doc); // Text content is there <StructTreeRoot xmlns="http://iso.org/pdf/ssn" xmlns:bfo="urn:bfopdf"> <Document> <P id="p1"> <bfo:content mcid="0" page="0">Hello, World</bfo:content> </P> </Document> </StructTreeRoot>
When saving, simple repairs will be made unless you say otherwise
There are various requirements placed on the Structure Tree by profiles like PDF/UA. For example, the <THead> must always be before any <TBody> in a <Table>. If these restrictions are not met we will try to repair them, notifying you of this by emitting a warning code beginning with "SD".
If for some reason you don't want this to happen (if you're trying to ensure your input is correct, for example, automatic repairs may not be helpful), then again this can be turned off with a parameter to the DomConfig, as shown below. Any errors will throw an exception when saving instead.
Document doc = pdf.getStructureTree(); doc.getDomConfig().setParameter("fix-structure", false);
Sorting content added via the beginTag method
It's quite common to want content in the tree in a different order to the way the
same
content is placed on the page - for example, if you draw the backgrounds of various
objects first, then the text content on top. While it's possible to just dump everything
onto the page and then move the content around later in the tree, another approach
makes
use of two special attributes that can be passed to beginTag
: "bfo:sort"
and "bfo:uuid". Both are optional.
The "bfo:uuid" attribute can be any String, and is used to uniquely identify an element.
Sibling elements with the same UUID in the tree are merged when normalizeDocument
is called; the content is all moved to the first element of the set.
atts = new HashMap<String,Object>(); atts.put("bfo:uuid", 1); page.beginTag("P", atts); page.drawText("abc", 100, 100); page.endTag(); atts.put("bfo:uuid", 2); page.beginTag("Span", atts); page.drawText("def", 100, 100); page.endTag(); atts.put("bfo:uuid", 1); page.beginTag("P", atts); page.drawText("ghi", 100, 100); page.endTag();
<StructTreeRoot xmlns:bfo="urn:bfopdf" xmlns="http://iso.org/pdf/ssn" <P> <bfo:content mcid="0"">abc</bfo:content> <bfo:content mcid="2"">ghi</bfo:content> </P> <Span> <bfo:content mcid="1"">def</bfo:content> </Span> </StructTreeRoot>
Further control is available with the "bfo:sort" attribute, which should be an instance
of java.util.Comparable
(a java.lang.Integer
is a good choice).
Sibling elements will be sorted on this key, before they are merged on their uuid.
A very common case is trying to convert an existing XML document to a PDF structure.
The easiest way to do this is to ensure that the "bfo:uuid" and "bfo:sort" attributes
are both set to an Integer which is the index in document order of the original node.
This will allow you to add content to the page in any order you like; so long as the
beginTag/endTag
calls are nested properly and the "bfo:sort" and "bfo:uuid"
attribute are set, the resulting tree will be in the same order as the input tree.
One last tip: the "bfo:location" attribute can be set to any String, and will be included in any warning or error messages printed about the Structure Tree. Set it to the location of the original element to aid debugging.
Accessing the Document "Role Map"
An aspect of the Structure Tree that is not part of XML is the ability to remap tags;
For example, if you wished to represent both <pre> and <p> in PDF/UA you
have a problem, as only <P> is a recognised Tag. You can do this by mapping the
<pre> tag to <P> by way of the document role-map. This is retrieved
from the DomConfig
as before.
Document doc = pdf.getStructureTree(); rolemap = (Map<String,String>)doc.getDomConfig().getParameter("role-map"); rolemap.put("pre", "P"); page.beginTag("pre", null); // Now valid in PDF/UA, as pre is mapped to P
Element names retrieved via the DOM interface are always the original values before
remapping; in the example above, the Element.getTagName()
method would return
"pre"
Differences from previous releases
Finally, some small changes were made to the beginTag
method which are incompatibile
with previous releases.
- "ID", "C, "T", "text" and "E" were aliases for "id", "class", "title", "ActualText" and "abbr" respectively. These aliases have been removed.
- Standard attributes could be specified without their Owners; for example you could specify "RowSpan" as an alias for "Table:RowSpan". These aliases have been removed.
- We've added a lot of new features to OutputProfile to better profile PDF/UA, and a few of the older ones have been removed. There's no real reason to reference those individual features, so this is unlikely to affect anyone. A quick recompile against 2.24 will identify if that's the case - if it is, drop us a line at support@bfo.com.
Conclusion
For the most part it will be easiest to create a Structure Tree with a larger project built on top of the PDF API, such as our Report Generator. Most of the changes in this release are designed to facilitate that, but there are others:
- The improvements to PDF/UA validation.
- The ability to merge PDFs with a Structure Tree and get a valid (and useful) result.
- The fix to ensure content is not considered damaged by Acrobat.
We hope that these features will make 2.24 useful upgrade for many.