Converting PDFs to bitmap PDFs

When only a raster will do, how to do it efficiently

Converting between PDF and bitmaps is a topic that we just keep coming back to. In this article we're going to look at a pretty common scenario - converting a PDF to a rasterized PDF, which is nothing more complicated than a PDF that contains nothing but images.

Why would you want to do this? Well, typically it's when you're converting a PDF to PDF/A or PDF/X, but the PDF has something about it that you can't fix - transparency perhaps, or unembedded fonts. Converting the page to a bitmap image of the page is the least-bad option here (the best option of course, is to start with a PDF constructed with this in mind, which we will leave as an exercise for the reader).

The trivially correct approach

When faced with this problem, most people start by converting the PDF to TIFF, then convert the TIFF back to PDF. You can do this in about 10 lines of code and it will work, but it has a few problems - most obviously, it performs needless work in writing the TIFF structure, and potentially requires compression then recompression of the image (although this usually won't be the case if you use our PDFParser.writeAsTIFF method).

You can work around this by doing something like the following:

PDFParser parser = new PDFParser(oldpdf);
PDF newpdf = new PDF();
for (int i=0;i<pdf.getNumberOfPages();i++) {
    PDFPage oldpage = pdf.getPage(i);
    PagePainter painter = parser.getPagePainter(oldpage);
    BufferedImage image = painter.getImage(200, PDFParser.RGB);
    PDFImage pdfimage = new PDFImage(image);
    PDFPage newpage = newpdf.newPage((int)oldpage.getWidth(), (int)oldpage.getHeight());
    newpage.drawImage(pdfimage, 0, 0, page.getWidth(), page.getHeight());
}   
which will iterate over the list of pages in the PDF, convert each one to a BufferedImage, then create a PDFImage from that and draw that onto a new page in a new PDF.

The parameters 200 and PDFParser.RGB to the getImage method are the resolution (dots per inch) and Color Model to use respectively, and this leads us to the problem with this approach.

Why is my file so much larger?

If you're not sure what's in your PDF you'll probably render in color to be sure you don't lose any content, and at 200dpi (which is usually enough in color) that's a lot of pixel data!. If your original file contained just black text on a white background, you could easily see an increase in filesize from a few KB per page to several hundred KB.

Multiply this by several thousand documents and you have a problem.

Ideally you'd save only pages with color content in color, and everything else in black and white, but how do you know which page has what?

The 2.13.2 release of the PDF Library adds support for this via some new OutputProfile features that identify color, grayscale or black and white content on a page. You can find out which features apply to a particular page by rendering just that page, and then checking the OutputProfile after rendering.

Here's a modified version of the above code with this functionality added:

PDFParser parser = new PDFParser(oldpdf);
for (int i=0;i<oldpdf.getNumberOfPages();i++) {
    PDFPage oldpage = pdf.getPage(i);
    OutputProfile profile = new OutputProfile(OutputProfile.Default);
    parser.setOutputProfile(profile);
    PagePainter painter = parser.getPagePainter(page);

    BufferedImage bufimage = painter.getImage(200, PDFParser.RGB);
    if (profile.isSet(OutputProfile.Feature.ColorImage) || profile.isSet(OutputProfile.Feature.ColorContent)) {
        // Page had color content, do something
    } else if (profile.isSet(OutputProfile.Feature.GrayscaleImage) || profile.isSet(OutputProfile.Feature.GrayscaleContent)) {
        // Page had grayscale content, do something
    } else {
        // Page had only pure black and white content, do something.
    }

    PDFImage pdfimage = pdfimage = new PDFImage(bufimage);
    PDFPage newpage = newpdf.newPage((int)oldpage.getWidth(), (int)oldpage.getHeight());
    newpage.drawImage(pdfimage, 0, 0, oldpage.getWidth(), oldpage.getHeight());
}   

Processing the images

This shows how to differentiate a page with color content from one with grayscale or black & white. From there, one option is simply to convert the color bitmap to 1-bit for black & white, but this leads to fidelity problems: what is an acceptable resolution for color might not be good enough for 1-bit. For example, 200dpi gives a very good quality color image, but at 1-bit it's fax resolution, which can be a bit blocky.

Let's assume this is a concern, and that your chosen resolution for color and grayscale is 200dpi and for black & white is 300dpi. This gives the following options:

  1. Rasterize at 300dpi in black & white; if it's color or grayscale, rasterize a second time at 200dpi with the correct color model
  2. Rasterize at 200dpi in color; if it's grayscale remove the color information; if it's black & white, rasterize a second time at 300dpi in black & white
  3. Rasterize at 300dpi in color; if it's color or grayscale downsample to 200dpi before removing color information; if it's black & white just convert to black & white.

Which of these is best is hard to say: the first option will rasterize twice if it's color or grayscale, and the second option will rasterize twice if it's black & white. The last option requires a 300dpi color image, which will require the most memory; however it only rasterizes the PDF once.

Which is quicker is clearly going to depend on how long it takes to rasterize, and that is entirely dependent on the source document: if your PDF is mostly text, rasterizing will be quick, but if it contains large images or color-spaces other than RGB then it will be much slower.

We need some numbers. Time to benchmark.

Implementations

The complete code for this article can be downloaded at this link

For the first option, we'll change the second highlighted block in the above example to this:
// Option 1
BufferedImage image = painter.getImage(300, PDFParser.BLACKANDWHITE);
if (profile.isSet(OutputProfile.Feature.ColorImage) || profile.isSet(OutputProfile.Feature.ColorContent)) {
    image = painter.getImage(200, PDFParser.RGB);
} else if (profile.isSet(OutputProfile.Feature.GrayscaleImage) || profile.isSet(OutputProfile.Feature.GrayscaleContent)) {
    image = painter.getImage(200, PDFParser.GRAYSCALE);
} else {
    // No change required
}
For option two, here's the block:
// Option 2
BufferedImage image = painter.getImage(200, PDFParser.RGB);
if (profile.isSet(OutputProfile.Feature.ColorImage) || profile.isSet(OutputProfile.Feature.ColorContent)) {
    // No change required
} else if (profile.isSet(OutputProfile.Feature.GrayscaleImage) || profile.isSet(OutputProfile.Feature.GrayscaleContent)) {
    image = fixImage(image, 200, 200, 8); // TODO
} else {
    image = painter.getImage(bwdpi, PDFParser.BLACKANDWHITE);
}   
And here's the code for option 3. The resampling code is detailed below:
// Option 3
BufferedImage image = painter.getImage(bwdpi, PDFParser.RGB);
if (profile.isSet(OutputProfile.Feature.ColorImage) || profile.isSet(OutputProfile.Feature.ColorContent)) {
    image = fixImage(image, 300, 200, 24); // TODO
} else if (profile.isSet(OutputProfile.Feature.GrayscaleImage) || profile.isSet(OutputProfile.Feature.GrayscaleContent)) {
    image = fixImage(image, 300, 200, 8); // TODO
} else {
    image = fixImage(image, 300, 300, 1); // TODO
}

Options 2 and 3 require code to downsample the image from 300 to 200dpi, and possibly convert from 24-bit color to 8-bit grayscale or 1-bit black & white.

The exact implementation isn't important for our purposes - the java.awt.image package has various capable but convoluted methods for this, but our experience has led us to believe it's quicker and easier to write your own code, especially if you know the input format of your image. For completeness, here's our method for downsampling and converting from 24 to 8 or 1-bit.
/**
 * Given a 24-bit RGB image, optionally resize it and/or reduce the number of bits per pixel to 8 or 1
 * @param image the image to be resize - must be 24-bit RGB with byte-based Raster (DataBufferByte)
 * @param indpi the DPI the image is currently in
 * @param outdpi the DPI to convert the image to
 * @param bpp the number of bits per pixel - 24 for RGB, 8 for Grayscale or 1 for B&amp;W
 * @return the modified image
 */
private static BufferedImage fixImage(BufferedImage image, int indpi, int outdpi, int bpp) {
    // Assuming a DataBufferByte for both input and output image - this will always be
    // the case in this eaxmple, but some ColorModels (eg ColorModel.getRGBdefault) use
    // a DataBufferInt. So this is not a general purpose routine, but as used here this
    // assumption makes the code a lot simpler and more efficient.
    ColorModel cm = image.getColorModel();
    WritableRaster raster = image.getRaster();
    int w = raster.getWidth();
    int h = raster.getHeight();

    if (indpi != outdpi) {
        double scale = (double)outdpi / indpi;
        w *= scale;
        h *= scale;
        WritableRaster outraster = cm.createCompatibleWritableRaster(w, h);
        AffineTransform tran = AffineTransform.getScaleInstance(scale, scale);
        // This is typically hardware accelerated (or at least native code)
        // so can't be beat for performance.
        AffineTransformOp scaler = new AffineTransformOp(tran, AffineTransformOp.TYPE_BILINEAR);
        scaler.filter(raster, outraster);
        raster = outraster;
    }

    if (bpp == 8) {
        // Remove color information but keep 1 pixel per byte - easy.
        cm = PDFParser.GRAYSCALE;
        WritableRaster outraster = cm.createCompatibleWritableRaster(w, h);
        byte[] in = ((DataBufferByte)raster.getDataBuffer()).getData();
        byte[] out = ((DataBufferByte)outraster.getDataBuffer()).getData();
        int i = 0, j = 0;
        while (i < in.length) {
            int rgb = ((in[i++]&0xFF) << 16) | ((in[i++]&0xFF) << 8) | (in[i++]&0xFF);
            // This is fast way of converting RGB to Grayscale.  It's an integer
            // based version of the standard PAL/NTSC grayscale formula:
            // gray = 0.3red + 0.59green + 0.11blue
            int gray = rgb==0xFFFFFF ? 255 : ((((rgb&0xFF0000)/850) + (((rgb<<8)&0xFF0000)/432) + ((rgb<<16)&0xFF0000)/2318)) >> 8;
            out[j++] = (byte)gray;
        }
        raster = outraster;
    } else if (bpp == 1) {
        // Remove color information and reduce to 1 bit per pixel, or 7 pixels per byte.
        // Still fairly simple, we just need to byte align each row.
        cm = PDFParser.BLACKANDWHITE;
        WritableRaster outraster = cm.createCompatibleWritableRaster(w, h);
        byte[] in = ((DataBufferByte)raster.getDataBuffer()).getData();
        byte[] out = ((DataBufferByte)outraster.getDataBuffer()).getData();
        int i = 0, j = 0;
        for (int y=0;y<h;y++) {
            int x = 0, n = 0;
            for (x=0;x<w;x++) {
                int rgb = ((in[i++]&0xFF) << 16) | ((in[i++]&0xFF) << 8) | (in[i++]&0xFF);
                int gray = rgb==0xFFFFFF ? 255 : ((((rgb&0xFF0000)/850) + (((rgb<<8)&0xFF0000)/432) + ((rgb<<16)&0xFF0000)/2318)) >> 8;
                n <<= 1;
                if (gray > 128) {   // 128 is normal threshold, but you can adjust
                    n |= 1;
                }
                if ((x&7) == 7) {   // Finished 8 pixel block - push it to output
                    out[j++] = (byte)n;
                    n = 0;
                }
            }
            if ((x&7) != 7) {    // Image isn't multiple of 8 wide - shift and push it to output
                n <<= 8 - (x&7);
                out[j++] = (byte)n;
            }
        }
        raster = outraster;
    } else if (bpp != 24) {
        throw new IllegalArgumentException("bpp must be 24, 8 or 1");
    }
    if (raster != image.getRaster()) {
        image = new BufferedImage(cm, raster, false, null);
    }
    return image;
}

Results

As we said above, the exact results are going to depend on your content, so we're going to run a few tests on various different documents:

  1. The first 50 pages of Adobe JavaScript Reference: almost all text, with a mix of pages in color, grayscale and black & white
  2. The first edition of The MagPi magazine - 32 pages, all color, lots of images, expensive to render
  3. The first 50 pages of the TIFF specification black and white text except for a color annotation on page 1 and a grayscale image on page 14

We're also going to compare the results to the "default" approach, which is 200dpi RGB as described at the top of this article

File Option 1 Option 2 Option 3 Default
time filesize time filesize time filesize time filesize
JavaScript Reference 44s 5.7MB 50s 5.7MB 99s 8MB 70s 10MB
MagPi magazine 196s 53MB 105s 53MB 170s 57MB 104s 53MB
TIFF specification 20s 2.6MB 50s 2.6MB 77s 2.7MB 70s 10MB

Conclusions

There's a few things we can learn from this:

  • Rendering at high-resolution color is an expensive operation! Not just with memory allocation, but the jump from 200 to 300dpi is ⅓ more pixels to manipulate.
  • For the some documents, it's actually quicker to render the PDF to a bitmap twice than it is to render it once. This only makes sense if you consider that the color image is a much larger object to compress, and that compression is slow. So reducing your filesize may also speed things up, which is a definite bonus
  • If you're expecting lots of color documents, option 2 is a good middle ground. But if the bulk of your documents are simply black & white then option 1 is the winner

The full source code we used to develop this article available to download here.