Converting PDF to PDF/A

with the BFO Java PDF Library v2.26

We first added PDF/A support to our API many years ago, but our focus was on validation - you load a PDF, and the API tells you if its valid or not.

If a file isn't PDF/A and you need it to be, we've got a method to fix that too: OutputProfiler.apply(). Give this method a target OutputProfile, such as PDF/A-1b:2005, and it will adjust the PDF to match.

But in many cases the PDF was beyond what we could repair, and apply() threw an Exception indicating conversion failed. At that point your only real option was to rasterize each page to a bitmap image to replace the original, perhaps even copying those pages to a new PDF if the original had content we couldn't repair.

This is how we've done things for years, and while it works, there were two main problems with it:

You, the developer had to write quite a lot of code to manage the process - checking for failures, choosing the correct ColorSpace, deciding what to do if conversion failed. These decisions meant that a "one click" solution was hard for us to offer.
Conversion to bitmap happened more than it needed to, and it's slow. What's more, when it happens something is usually lost in the process. Most obviously the text is no longer searchable, but potentially more, depending on how much work the developer has put in to "salvage" content from the original PDF: metadata, bookmarks and so on.

We've helped many customers doing variations on this same task, and it seemed the right time to bring it in-house and package it up to make it easier. A couple of weeks work, we figured. Well, four months later, it turns out we figured wrong. But the new approach is quite an improvement.

So what follows below is a step-by-step guide to the process. These steps are also wrapped up in a new example, ConvertToPDFA.java (download), which we now ship with the API. It's a standalone class designed to be incorporated into your code, with a main method so you can test from the command line.

We've run this over 17,000 test files here, converting all to PDF/A-1, A-2 and A-3, comparing the results to both Acrobat and veraPDF. We're confident it will handle whatever PDF you feed it.

Here's how it works.

The basics: Setup and Fonts

PDF/A requires all visible text to be rendered in an embedded font, so the first thing you're going to need for conversion to PDF/A is a set of fonts for substitution. But which ones? Based on a survey of our test files here, the most common unembedded fonts are, in order:

Times
Helvetica
Arial
Courier
Symbol
ZapfDingbats
Arial Narrow
Helvetica Narrow
Helvetica Black
Palatino
Helvetica Black
Letter Gothic
New Century Schoolbook
Tahoma
Verdana

Of course this is very western-centric, as we've been bulk testing with the GovDocs Corpora. Chinese/Japanese/Korean documents have a much higher proportion of unembedded fonts due to their size and the fact their glyphs have regular metrics, whereas Cyrillic, Arabic, Hebrew, Hindi etc. fonts will almost always be be embedded.

So we don't recommend you include all of those. Many uses of Times, Helvetica, Courier and almost all of Symbol or ZapfDingbats refer to the Standard 14 PDF fonts, which have a predefined set of glyphs. We include free substitutions for those with the API.

For the rest, if we can't find a match based on the font name, we'll ensure we choose a font that has all the required glyphs and the most similar metrics to the font it's replacing.

What we recommend is including the Times, Arial, and Courier fonts that ship with Windows (assuming you're running Windows). Don't forget the bold and italic variations. We also recommend at least one of the Noto CJK fonts, typically NotoSansCJKsc-Regular.otf, and ideally a narrow sans-serif font, such as Arial-Narrow, which will serve as a match for any narrow fonts. In total that's about 15 fonts. Here's how we'd get started - loading a PDF, creating a OutputProfiler and giving it some fonts to substitute.

    import org.faceless.pdf2.*;
    import java.util.*;
    import java.io.*;
    import java.awt.color.ColorSpace;

    PDF pdf = new PDF(new PDFReader(new File("input.pdf")));
    OutputProfiler profiler = new OutputProfiler(new PDFParser(pdf));
    OutputProfile profile = profiler.getProfiler();

    OutputProfile target = OutputProfile.PDFA1b_2005;
    if (profile.isCompatibleWith(target) == null) {
        return "PDF is already compatible with PDF/A-1b";
    }

    // We need to convert the PDF. Load some fonts.
    OutputProfiler.AutoEmbeddingFontAction fa = new OutputProfiler.AutoEmbeddingFontAction();
    fa.add(new OpenTypeFont(new File("path/to/NotoSansCJKsc-Regular.otf", null)));
    fa.add(new OpenTypeFont(new File("path/to/times.ttf", null)));
    fa.add(new OpenTypeFont(new File("path/to/timesi.ttf", null)));
    fa.add(new OpenTypeFont(new File("path/to/timesbd.ttf", null)));
    fa.add(new OpenTypeFont(new File("path/to/timesbi.ttf", null)));
    fa.add(new OpenTypeFont(new File("path/to/arial.ttf", null)));
    fa.add(new OpenTypeFont(new File("path/to/ariali.ttf", null)));
    // etc
    profiler.setFontAction(fontAction);

Color

Conversion also requires that all device-dependent color is made device-independent/calibrated, and choosing the best way to do this is the most complicated aspect of any PDF/A conversion. The example below shows the basics, but we'll expand on this later (and in the ConvertToPDFA.java example too).

The best approach is to assign the PDF an Output Intent, which describes the ICC Color Profile of the device the document is intended for. But this is just a single ICC profile; if your PDF has both device-dependent RGB and device-dependent CMYK then previously your only option was to rasterize.

This release expands the existing ProcessColorAction object which you supply to the OutputProfiler to convert color. You can now supply it with a number of ColorSpace objects - some RGB, some CMYK - and it will choose the appropriate ones to anchor device-dependent colors that don't match the Output Intent to an ICC profile. Which profiles should you supply? In Europe, we recommend FOGRA39 ("Coated FOGRA 39 300" is a good choice), and in the Americas we recommend SWOP2013. In Japan, "Japan Color 2011". ICC profiles for all of these are available for download at https://www.color.org/registry/index.xalter.

Java color is entirely based on sRGB, so that's usually the best choice for RGB.

Always supply at least one RGB and one CMYK to ensure conversion can succeed.

    // We usually want our OutputProfile to have an "Output Intent", so choose 
    // one. We'll go with the FOGRA CMYK profile for now, but see below
    // for some real-world advice
    target = new OutputProfile(target);
    ColorSpace fogra39 = new ICCColorSpace(new File("Coated_Fogra39L_VIGC_300.icc")));
    ColorSpace srgb = ColorSpace.getInstance(ColorSpace.CS_sRGB);
    OutputIntent intent = new OutputIntent("GTS_PDFA1", null, fogra39);
    target.getOutputIntents().add(intent);

    List<ColorSpace> cslist = new ArrayList<>();
    cslist.add(srgb);
    cslist.add(fogra39);
    OutputProfiler.ProcessColorAction action = new OutputProfiler.ProcessColorAction(target, cslist);
    profiler.setColorAction(action);

Strategy: what do do with the rest

Any conversion from PDF to PDF/A potentially involves data loss: for example, PDF/A-1 doesn't allow embedded files, so f they exist they need to be deleted. But an API can't just delete content from your document without your instruction! We need to give you, the developer, some control over this process.

For that we have a new setStrategy() method. The Default strategy will not delete content from the PDF, and conversion may fail as a result - you can deal with it in your code.

We have other strategies for conversion, and we suspect the most useful will be JustFixIt - it does whatever it takes to make the file compliant with your target profile. If we need to delete attached files, or remove digital signatures to do this, choose this strategy and we can.

    profiler.setStrategy(OutputProfiler.Strategy.JustFixIt);

JustFixIt is a shorthand for a combination of several other strategies, so it's possible to customize the process. See the API docs for more detail.

Rasterizing where required, Rebuilding as a last resort

With PDF/A-1 in particular, we may have to rasterize the document due to features we just can't work around, such as transparency. With PDF/A-2 or later it's much less common (required on only 316 of our 100,561 test pages), but it can still happen: for example, if the PDF nests save/restore operations deeper than the recommended maximum of 26.

Where previously we would fail with an Exception and let you sort out the rasterization yourself, we now have a new Rasterizing Action which will do this for you. This approach means we'll only rasterize the pages that are causing problems, and we'll overlay the rasterized page with invisible text, retaining any structure on the page for the PDF structure tree. Text will remain selectable and searchable, and the PDF can continue to meet the requirements of PDF/A-1 or PDF/UA.

(There will be cases where we can't overlay the text - for example, where the text on the page uses to a undefined encoding. If this is the case, you'll just get the plain bitmap with no invisible text).

    OutputProfile.RasterizingAction ra = new OutputProfiler.RasterizingAction();
    ra.setDPI(200); // the default
    profiler.setRasterizingAction(ra);
 
    // Setup all done! Convert the PDF to PDF/A
    profiler.apply(targetprofile);

Even after all of this, there are still cases where the resulting PDF is not PDF/A. Most likely this is due to fundamental architectural limits (such as arrays with more than 8191 entries) which are just not allowed. All we can do here is "rebuild" - clear out the PDF entirely, then put back only data which we know is OK. This is very much a last resort, and not common - of 5000 tests this occurred in just 36, and 31 of those were test files designed to provoke this situation.

However it will happen if no suitable font can be found for substitution, for example, or if our code failed to convert for any other reason.

Think of the "Rebuild" step as the insurance step. If you absolutely, positively need a PDF/A at the other end of this process, enable Rebuild. In our example, it's already enabled - it's part of the JustFixIt strategy we set above.

Real world experience

What we've got above, in about 40 lines of code, is a very simple example to convert a PDF to PDF/A-1b with a CMYK output intent. But in the real world there are a lot of other things to consider.

What if the PDF already meets PDF/A-1a, or A-2b? Or even PDF/UA-1 or ZUGFeRD?

Well, the above code specifies PDF/A-1b so that's all you're going to get. A better approach would be to define a set of allowed targets which you'll accept - PDF/A-1a, PDF/A-1b, PDF-A/2a, etc. If the PDF is already compliant with one of these, great. And if it claims to be compliant with one of them but isn't, that's the one we'll choose to target. Otherwise we fall back to our default.

    OutputProfile target = OutputProfile.PDFA1a_2005;
    Collection<OutputProfile> allowed = Arrays.asList(OutputProfile.PDFA1b_2005, OutputProfile.PDFA1a_2005, OutputProfile.PDFA2a /* etc */);
    Collection<OutputProfile> claimed = profile.getClaimedOutputProfiles();
    for (OutputProfile p : claimed) {
      if (allowed.contains(p)) {
        target = p;
        break;
      }
    }

Notice we've chosen PDF/A-1a, not A-1b - the difference between A and B is the conformance level which in most cases shouldn't be a distinction that matters: it's a technical statement of what the PDF is capable of, and most workflows creating PDF/A files should have a policy similar to "use PDF/A-2a if you can, PDF/A-2u as the next choice, and PDF/A-2b as the last resort".

The AutoConformance strategy (part of the JustFixIt strategy) lets us adjust conformance to match the PDF, essentially following the above policy.

With this Strategy, use the "A" conformance level (the strictest one) as a target.

PDF/UA, ZUGFeRD and PDF/X are a little different - we don't want to actively change the document to match these targets, but if the PDF already complies with one of these then we don't want to lose that. Another slight adjustment:

    OutputProfile target = OutputProfile.PDFA1a_2005;
    Collection<OutputProfile> allowed = Arrays.asList(OutputProfile.PDFA1b_2005, OutputProfile.PDFA1a_2005, OutputProfile.PDFA2a);
    Collection<OutputProfile> retained = Arrays.asList(OutputProfile.PDFUA1, OutputProfile.PDFX4, OutputProfile.ZUGFeRD1_Basic, OutputProfile.ZUGFeRD1_Comfort, OutputProfile.ZUGFeRD1_Extended);
    Collection<OutputProfile> claimed = profile.getClaimedOutputProfiles();
    for (OutputProfile p : claimed) {
      if (allowed.contains(p)) {
        target = p;
        break;
      }
    }
    target = new OutputProfile(target);
    for (OutputProfile p : claimed) {
      if (retained.contains(p) && profile.isCompatibleWith(p) == null) {
        try {
          target.merge(p, profile);
        } catch (ProfileComplianceException e) {
          // Combination is not possible.
        }
      }
    }

To keep things simple, we haven't shown how to remove a claim of PDF/UA etc. if it can't be met. It's shown in the attached example.

What do I use for an OutputIntent? CMYK or RGB?

Your first choice will typically be to reuse any OutputIntent in the original document; the original author knows best. We can only reuse it if its valid for PDF/A, but this test is fairly easy - we extend the above code that sets our "target" like so:

   for (OutputIntent intent : profile.getOutputIntents()) {
     if (intent.isCompatibleWith(target) == null) {
       target.getOutputIntents().add(new OutputIntent("GTS_PDFA1", intent));
     }
   }

But if after that you still don't have a GTS_PDFA1 OutputIntent, you'll need to choose either CMYK or RGB. Unfortunately this is a slightly complicated choice to make - Acrobat, and possibly other tools, make a decision on whether to display a PDF with simulated overprint or not based on a few factors, one of which seems to be the Output Intent of the document. So this choice is significant.

The ad-hoc algorithm we're using is subject to revision, but is currently: if the PDF makes use of Overprint, CMYK blending, has a Cyan, Magenta or Yellow separation, or if it doesn't make use of device-dependent RGB color, it's probably best to use CMYK. Otherwise, we use RGB.

To implement this, we add this block after the block above.

   if (target.getOutputIntents().isEmpty()) {
     boolean cmyk = false;
     for (OutputProfile.Separation s : profile.getColorSeparations()) {
       String n = s.getName();
       cmyk |= n.equals("Cyan") || n.equals("Magenta") || n.equals("Yellow");
     }
     cmyk |= profile.isSet(OutputProfile.Feature.TransparencyGroupCMYK);
     cmyk |= profile.isSet(OutputProfile.Feature.Overprint);
     cmyk |= !profile.isSet(OutputProfile.Feature.ColorSpaceDeviceRGB);
     ColorSpace cs = cmyk ? fogra39 : srgb;
     OutputIntent intent = new OutputIntent("GTS_PDFA1", null, cs);
     if (intent.isCompatibleWith(target) == null) {
       target.getOutputIntents().add(intent);
     }
   }

Other options you might want to consider are reusing an ICC profile from the incoming PDF, and keeping any existing non-PDF/A Output Intents - a PDF can be both PDF/A and PDF/X-4, with two Output Intents, so long as they both refer to the same color space (although earlier versions of PDF/X disallow this). The attached example shows how to do both.

Conclusion

The ConvertToPDFA.java example (download) encapsulates everything described above. It's a reusable class which you can incorporate in your own project, and if you're in the business of converting PDF to PDF/A, we hope it's going to make your life a lot easier. The example is also included in the examples folder of the PDF library download.

In the event the process needs revising, we'll keep this article (and the example) up to date with footnotes. In particular conversion to PDF/A-4 has only had minimal testing at this point, so watch this space.

Tags: PDFA convert

Posted by Mike Bremford on 04 Aug 2021 at 17:00

Previous Article Next Article New Comment Back to index

Name
Email
Subject