When only a raster will do, how to do it efficiently
Converting between PDF and bitmaps is a topic that we just keep coming back to. In this article we're going to look at a pretty common scenario - converting a PDF to a rasterized PDF, which is nothing more complicated than a PDF that contains nothing but images.
Why would you want to do this? Well, typically it's when you're converting a PDF to PDF/A or PDF/X, but the PDF has something about it that you can't fix - transparency perhaps, or unembedded fonts. Converting the page to a bitmap image of the page is the least-bad option here (the best option of course, is to start with a PDF constructed with this in mind, which we will leave as an exercise for the reader).
The trivially correct approach
When faced with this problem, most people start by converting the PDF to TIFF, then
convert the TIFF back to PDF. You can do this in about 10 lines of code and it will
work, but it has a few problems - most obviously, it performs needless work in writing
the TIFF structure, and potentially requires compression then recompression of the
image (although this usually won't be the case if you use our PDFParser.writeAsTIFF
method).
You can work around this by doing something like the following:
PDFParser parser = new PDFParser(oldpdf); PDF newpdf = new PDF(); for (int i=0;i<pdf.getNumberOfPages();i++) { PDFPage oldpage = pdf.getPage(i); PagePainter painter = parser.getPagePainter(oldpage); BufferedImage image = painter.getImage(200, PDFParser.RGB); PDFImage pdfimage = new PDFImage(image); PDFPage newpage = newpdf.newPage((int)oldpage.getWidth(), (int)oldpage.getHeight()); newpage.drawImage(pdfimage, 0, 0, page.getWidth(), page.getHeight()); }which will iterate over the list of pages in the PDF, convert each one to a BufferedImage, then create a PDFImage from that and draw that onto a new page in a new PDF.
The parameters 200
and PDFParser.RGB
to the getImage
method are the resolution (dots per inch) and Color Model to use respectively, and
this leads us to the problem with this approach.
Why is my file so much larger?
If you're not sure what's in your PDF you'll probably render in color to be sure you don't lose any content, and at 200dpi (which is usually enough in color) that's a lot of pixel data!. If your original file contained just black text on a white background, you could easily see an increase in filesize from a few KB per page to several hundred KB.
Multiply this by several thousand documents and you have a problem.
Ideally you'd save only pages with color content in color, and everything else in black and white, but how do you know which page has what?
The 2.13.2 release of the PDF Library adds support for this via some new OutputProfile features that identify color, grayscale or black and white content on a page. You can find out which features apply to a particular page by rendering just that page, and then checking the OutputProfile after rendering.
Here's a modified version of the above code with this functionality added:
PDFParser parser = new PDFParser(oldpdf); for (int i=0;i<oldpdf.getNumberOfPages();i++) { PDFPage oldpage = pdf.getPage(i); OutputProfile profile = new OutputProfile(OutputProfile.Default); parser.setOutputProfile(profile); PagePainter painter = parser.getPagePainter(page); BufferedImage bufimage = painter.getImage(200, PDFParser.RGB); if (profile.isSet(OutputProfile.Feature.ColorImage) || profile.isSet(OutputProfile.Feature.ColorContent)) { // Page had color content, do something } else if (profile.isSet(OutputProfile.Feature.GrayscaleImage) || profile.isSet(OutputProfile.Feature.GrayscaleContent)) { // Page had grayscale content, do something } else { // Page had only pure black and white content, do something. } PDFImage pdfimage = pdfimage = new PDFImage(bufimage); PDFPage newpage = newpdf.newPage((int)oldpage.getWidth(), (int)oldpage.getHeight()); newpage.drawImage(pdfimage, 0, 0, oldpage.getWidth(), oldpage.getHeight()); }
Processing the images
This shows how to differentiate a page with color content from one with grayscale or black & white. From there, one option is simply to convert the color bitmap to 1-bit for black & white, but this leads to fidelity problems: what is an acceptable resolution for color might not be good enough for 1-bit. For example, 200dpi gives a very good quality color image, but at 1-bit it's fax resolution, which can be a bit blocky.
Let's assume this is a concern, and that your chosen resolution for color and grayscale is 200dpi and for black & white is 300dpi. This gives the following options:
Which of these is best is hard to say: the first option will rasterize twice if it's color or grayscale, and the second option will rasterize twice if it's black & white. The last option requires a 300dpi color image, which will require the most memory; however it only rasterizes the PDF once.
Which is quicker is clearly going to depend on how long it takes to rasterize, and that is entirely dependent on the source document: if your PDF is mostly text, rasterizing will be quick, but if it contains large images or color-spaces other than RGB then it will be much slower.
We need some numbers. Time to benchmark.
Implementations
The complete code for this article can be downloaded at this link
For the first option, we'll change the second highlighted block in the above example to this:// Option 1 BufferedImage image = painter.getImage(300, PDFParser.BLACKANDWHITE); if (profile.isSet(OutputProfile.Feature.ColorImage) || profile.isSet(OutputProfile.Feature.ColorContent)) { image = painter.getImage(200, PDFParser.RGB); } else if (profile.isSet(OutputProfile.Feature.GrayscaleImage) || profile.isSet(OutputProfile.Feature.GrayscaleContent)) { image = painter.getImage(200, PDFParser.GRAYSCALE); } else { // No change required }For option two, here's the block:
// Option 2 BufferedImage image = painter.getImage(200, PDFParser.RGB); if (profile.isSet(OutputProfile.Feature.ColorImage) || profile.isSet(OutputProfile.Feature.ColorContent)) { // No change required } else if (profile.isSet(OutputProfile.Feature.GrayscaleImage) || profile.isSet(OutputProfile.Feature.GrayscaleContent)) { image = fixImage(image, 200, 200, 8); // TODO } else { image = painter.getImage(bwdpi, PDFParser.BLACKANDWHITE); }And here's the code for option 3. The resampling code is detailed below:
// Option 3 BufferedImage image = painter.getImage(bwdpi, PDFParser.RGB); if (profile.isSet(OutputProfile.Feature.ColorImage) || profile.isSet(OutputProfile.Feature.ColorContent)) { image = fixImage(image, 300, 200, 24); // TODO } else if (profile.isSet(OutputProfile.Feature.GrayscaleImage) || profile.isSet(OutputProfile.Feature.GrayscaleContent)) { image = fixImage(image, 300, 200, 8); // TODO } else { image = fixImage(image, 300, 300, 1); // TODO }
Options 2 and 3 require code to downsample the image from 300 to 200dpi, and possibly convert from 24-bit color to 8-bit grayscale or 1-bit black & white.
The exact implementation isn't important for our purposes - thejava.awt.image
package has various capable but convoluted methods for this, but our experience has
led us to believe it's quicker and easier to write your own code, especially if you
know the input format of your image. For completeness, here's our method for downsampling
and converting from 24 to 8 or 1-bit.
/** * Given a 24-bit RGB image, optionally resize it and/or reduce the number of bits per pixel to 8 or 1 * @param image the image to be resize - must be 24-bit RGB with byte-based Raster (DataBufferByte) * @param indpi the DPI the image is currently in * @param outdpi the DPI to convert the image to * @param bpp the number of bits per pixel - 24 for RGB, 8 for Grayscale or 1 for B&W * @return the modified image */ private static BufferedImage fixImage(BufferedImage image, int indpi, int outdpi, int bpp) { // Assuming a DataBufferByte for both input and output image - this will always be // the case in this eaxmple, but some ColorModels (eg ColorModel.getRGBdefault) use // a DataBufferInt. So this is not a general purpose routine, but as used here this // assumption makes the code a lot simpler and more efficient. ColorModel cm = image.getColorModel(); WritableRaster raster = image.getRaster(); int w = raster.getWidth(); int h = raster.getHeight(); if (indpi != outdpi) { double scale = (double)outdpi / indpi; w *= scale; h *= scale; WritableRaster outraster = cm.createCompatibleWritableRaster(w, h); AffineTransform tran = AffineTransform.getScaleInstance(scale, scale); // This is typically hardware accelerated (or at least native code) // so can't be beat for performance. AffineTransformOp scaler = new AffineTransformOp(tran, AffineTransformOp.TYPE_BILINEAR); scaler.filter(raster, outraster); raster = outraster; } if (bpp == 8) { // Remove color information but keep 1 pixel per byte - easy. cm = PDFParser.GRAYSCALE; WritableRaster outraster = cm.createCompatibleWritableRaster(w, h); byte[] in = ((DataBufferByte)raster.getDataBuffer()).getData(); byte[] out = ((DataBufferByte)outraster.getDataBuffer()).getData(); int i = 0, j = 0; while (i < in.length) { int rgb = ((in[i++]&0xFF) << 16) | ((in[i++]&0xFF) << 8) | (in[i++]&0xFF); // This is fast way of converting RGB to Grayscale. It's an integer // based version of the standard PAL/NTSC grayscale formula: // gray = 0.3red + 0.59green + 0.11blue int gray = rgb==0xFFFFFF ? 255 : ((((rgb&0xFF0000)/850) + (((rgb<<8)&0xFF0000)/432) + ((rgb<<16)&0xFF0000)/2318)) >> 8; out[j++] = (byte)gray; } raster = outraster; } else if (bpp == 1) { // Remove color information and reduce to 1 bit per pixel, or 7 pixels per byte. // Still fairly simple, we just need to byte align each row. cm = PDFParser.BLACKANDWHITE; WritableRaster outraster = cm.createCompatibleWritableRaster(w, h); byte[] in = ((DataBufferByte)raster.getDataBuffer()).getData(); byte[] out = ((DataBufferByte)outraster.getDataBuffer()).getData(); int i = 0, j = 0; for (int y=0;y<h;y++) { int x = 0, n = 0; for (x=0;x<w;x++) { int rgb = ((in[i++]&0xFF) << 16) | ((in[i++]&0xFF) << 8) | (in[i++]&0xFF); int gray = rgb==0xFFFFFF ? 255 : ((((rgb&0xFF0000)/850) + (((rgb<<8)&0xFF0000)/432) + ((rgb<<16)&0xFF0000)/2318)) >> 8; n <<= 1; if (gray > 128) { // 128 is normal threshold, but you can adjust n |= 1; } if ((x&7) == 7) { // Finished 8 pixel block - push it to output out[j++] = (byte)n; n = 0; } } if ((x&7) != 7) { // Image isn't multiple of 8 wide - shift and push it to output n <<= 8 - (x&7); out[j++] = (byte)n; } } raster = outraster; } else if (bpp != 24) { throw new IllegalArgumentException("bpp must be 24, 8 or 1"); } if (raster != image.getRaster()) { image = new BufferedImage(cm, raster, false, null); } return image; }
Results
As we said above, the exact results are going to depend on your content, so we're going to run a few tests on various different documents:
- The first 50 pages of Adobe JavaScript Reference: almost all text, with a mix of pages in color, grayscale and black & white
- The first edition of The MagPi magazine - 32 pages, all color, lots of images, expensive to render
- The first 50 pages of the TIFF specification black and white text except for a color annotation on page 1 and a grayscale image on page 14
We're also going to compare the results to the "default" approach, which is 200dpi RGB as described at the top of this article
File | Option 1 | Option 2 | Option 3 | Default | ||||
time | filesize | time | filesize | time | filesize | time | filesize | |
JavaScript Reference | 44s | 5.7MB | 50s | 5.7MB | 99s | 8MB | 70s | 10MB |
MagPi magazine | 196s | 53MB | 105s | 53MB | 170s | 57MB | 104s | 53MB |
TIFF specification | 20s | 2.6MB | 50s | 2.6MB | 77s | 2.7MB | 70s | 10MB |
Conclusions
There's a few things we can learn from this:
The full source code we used to develop this article available to download here.