PDF Library 2.18 and the OutputProfiler class

Release 2.18 of our PDF Library includes our new preflightingfunctionality. Previously if you wanted to identify which features are present in a PDF and optionally modify those features to bring the PDF into line with an output profile, like PDF/A or PDF/X, then you'd use two of the methods on the PDF class: getFullOutputProfile and setOutputProfile.

While this worked for the basic case described above, it had a few problems.

  1. Determining the existing OutputProfile on a PDF can take quite a long time, which is a problem as it was typically done in the calling thread. There was no way to check the progress of or cancel the operation.
  2. Setting a new OutputProfile on the PDF would try to adapt the PDF to the new profile's requirements, but that process was limited to items we could repair trivially. More complex operations, like replacing fonts or colors, couldn't be specified.
  3. Once the profile on a PDF was determined, there was no obvious way to manage how that result was cached. Should a subsequent call to PDF.getFullOutputProfile return the same object, or should it re-run the profiling?

Fixing those problems was more than we could do with the existing API, which is why those two methods have been deprecated. The OutputProfiler class is the replacement, and here we'll go into how it works.

For those of you using the deprecated methods, they will continue to work. You do not need to change your code unless you want to remove the deprecatedwarnings.

If you want to upgrade your code to avoid the deprecatedwarning, that's pretty simple. You can replace a call to pdf.getFullOutputProfile()with this:

OutputProfiler profiler = new OutputProfiler(new PDFParser(pdf));
OutputProfile profile = profiler.getProfile();
  
and if you then call pdf.setOutputProfile(target), you can do this with one more line:
OutputProfiler profiler = new OutputProfiler(new PDFParser(pdf));
OutputProfile profile = profiler.getProfile();
profiler.apply(target);
  

Background-thread friendly

The OutputProfiler class can be run in a background thread while another thread readsfrom the PDF (modifying the PDF while it's being profiled will lead to errors, so don't do that. But reading the content - for example turning the page into a bitmap, as we do in the viewer - is fine). The run and isRunning methods will get you started if you want to background-thread the process, and the API docs go into more detail on how.

Replacing fonts in the PDF with a "FontAction"

The really big benefits to the OutputProfiler class are the new actionsthat can be run on the PDF, which allow you to make fairly sweeping changes to the document content, and in particular some of the areas that typically cause problems during preflighting: fonts, colors and images:

PDF/A and PDF/X both require all the fonts in a PDF to be embedded, so if that's not the case you have two options: turn the page into a bitmap, or replace the fonts. We've covered the conversion to bitmap approach before and that's still a good option, but if you want to convert the fonts you can now set a FontAction on the OutputProfiler before calling apply. The getFont method will be called every time a font is specified, which may specify a replacement font.

We provide an implementation of this interface called AutoEmbeddingFontAction, which will replace any unembedded fonts with embedded ones. It will attempt to identify the correct font to use with heuristics, a word which sounds much better than guesswork but boils down to the same thing:

Give this class a set of embedded fonts, and we will try to find the best match based on the fonts name, the glyph metrics (how wide each character is), and whether the fonts share the same basic properties - Serif, Bold, Italic and so on. Here's a complete example which will process a PDF and replace any unembedded fonts in the file with their "best" match from the Windows "Fonts" directory.

PDF pdf = new PDF(new PDFReader(file));
OutputProfiler profiler = new OutputProfiler(new PDFParser(pdf));
OutputProfiler.AutoEmbeddingFontAction fontaction = new
OutputProfiler.AutoEmbeddingFontAction();
File[] fontfiles = new File("C:\\Windows\\Fonts").listFiles();
for (int i=0;i<fontfiles.length;i++) {
  if (fontfiles[i].getName().endsWith(".ttf")) {
    OpenTypeFont font = new OpenTypeFont(new FileInputStream(fontfiles[i]), 2);
    fontaction.add(font);
  }
}
profiler.setFontAction(fontaction);
profiler.apply(OutputProfile.Default);
pdf.render(new FileOutputStream(outfile));
  

This example will only replace the fonts, but will not modify the PDF in any other way - the OutputProfile.Defaultprofile doesn't require any changes to be made. Typically you'd replace the fonts as part of a larger conversion to PDF/A, and we'll show this below.

Replacing unembedded fonts is not the only possibility. Say you wanted to ensure that a PDF didn't embed any fonts that had restrictions on embedding. Provided you have a list of those fonts, this is easily done - replace the FontAction in the above example with something like this:

OutputProfiler.FontAction fontaction = new OutputProfiler.FontAction() {
  public PDFFont getFont(OutputProfiler profiler, String name, boolean embedded, PDFFont font) {
    if (embedded & disallowedfonts.contains(name)) {
      return appropriateSubstituteFont;
    }
    return null;
  }
};
  
One important caveat: replacing a font will not reflow the document. PDF is not a reflowable format, and any glyphs in the new font should ideally be roughly the same size (specifically, the advance should be the same) as glyphs in the font being replaced. The new glyphs will be stretched or squeezed to match the original metrics, and for extreme cases (such as replacing a monospaced font with a proportional one) this will lead to glyph distortion.

Replacing Colors with a "ColorAction"

PDF/A and PDF/X also place restictions on which colors (more accurately, which Color-Spaces) can be in the PDF. All colors must be calibrated, which is to say they must include details on how to convert to the CIE XYZ ColorSpace. PDF/X additionally requires that the colors are additive, i.e. they're not RGB.

This means that any Color specified in the PDF must eitherbe explicitly part of a calibrated ColorSpace, orit must be able to be interpreted as part of the output intentof the PDF.

The output intent is the device the PDF is intended to be displayed on, and must be specified for PDF/A and PDF/X documents. For PDF/X it's usually the ICC profile of the intended printer; for PDF/A, any ICC profile will do (a slight oversimplification) and theu sRGBspace is commonly used.

For a device-dependent color to be allowed, it must be convertible to this ColorSpace - this means RGB for an RGB profile, and CMYK or gray for a CMYK profile. Any color that doesn'tmeet these requirements must be converted.

Color Conversion is a complex matter and we're not going to go into the details too much, but for the case described above we supply a standard ColorAction: the ProcessColorAction will convert uncalibrated RGB, CMYK or grayscale colors to the specified ColorSpace. If any Spot colors are defined against an uncalibrated ColorSpace, they'll be redefined to map to the new ColorSpace. Here's how to use it:

PDF pdf = new PDF(new PDFReader(file));
ICC_Profile icc = ICC_Profile.getProfile(ColorSpace.CS_sRGB);
OutputProfiler profiler = new OutputProfiler(new PDFParser(pdf));
OutputProfiler.AutoEmbeddingFontAction coloraction = new
OutputProfiler.ProcessColorAction(icc);
profiler.setColorAction(coloraction);
profiler.apply(OutputProfile.Default);
pdf.render(new FileOutputStream(outfile));
  

As with our FontAction example above, this does nothing more than convert all uncalibrated colors in the PDF to sRGB. Typically this would be done as part of a larger conversion to PDF/A or PDF/X, which we'll demonstrate below.

With your own implementation of ColorAction there are many other possiblities. Replace the ColorAction in the above example with this one to convert all colors in the PDF to grayscale:

coloraction = new OutputProfiler.ColorAction() {
  final ColorSpace target = ColorSpace.getInstance(ColorSpace.CS_GRAY);
  public ColorSpace changeColor(OutputProfiler profiler, ColorSpace cs, float[] src, float[] dst, boolean fill, int type) {
    if (dst != null) {
      // Convert to XYZ, then use Y value with gamma of 2.2
      src = cs.toCIEXYZ(src);
      float g = src[1];
      dst[0] = (float)Math.pow(g, 1 / 2.2);
    }
    return target;
  }
};
  

Resampling images

The OutputProfiler can also downsample images. This is probably of less importance in modern workflows, but still can be useful for documents intended for download where file size is more important than fidelity. All images in the PDF are categorised as 1-bit, grayscale or color, and these may be downsampled with the setMaxImageDPI method.

Conclusion

These new operations should mean many more PDF documents can be converted to PDF/A without having to convert them to bitmap. There are still exceptions - transparency is disallowed in PDF/X-3 and PDF/A-1, so a PDF with transparency will still need to be rasterized. However we hope that for those of you tasked with converting large numbers of PDF to PDF/A for storage, this will both reduce the size of your archives and make it more useful: a bitmap PDF cannot be searched for text.

We've included a new example with the download package which implements much of the code described above. See the Preflight.javaexample in the examplesdirectory of the download.

Finally, if you come up with any interesting use cases for the Font or Color actions described above, or find the functionality doesn't quite meet your requirements, drop us a line at support@bfo.com, as we'd like to hear how this functionality is being used and how it can be improved.