Converting PDF to TIFF, PNG and JPEG

Smaller, faster, better: The importance of choosing correctly

We're asked some questions more than others at BFO, and one of the most common concerns conversion of a PDF to a bitmap image format - typically TIFF, but sometimes JPEG or another format. This process is called "rasterization" and while it's very easy to do with the "extended plus viewer" version of our PDF Library, it's worth going over in more detail.

Summary for the impatient

How fast your rasterization runs depends on the resolution of the output bitmap, the compression used and the complexity of the source document, in decreasing order of significance. For text, typically a 200dpi 1-bit TIFF image is the way to go. JPEG is the wrong answer.

In order to get small files and fast, acceptable results you need to decide three things: bit depth, format and resolution. File size is dependent on these factors, and speed is dependent on these factors and the contents of the page you're rasterizing: a page with large images in unusual ColorSpaces will slow things down, even if you're rendering to a thumbnail size output.

1. Bit Depth

This is the number of bits required to represent each pixel. Values typically range from 1 up to 8, then 24 and 32. 1bit gives you black and white only, with values up to 8bit giving up to 256 unique colors. 24bit is RGB and 32bit is CMYK (and you can add an optional alpha (transparency) channel to both of these for 32bit or 40bit images, although we won't cover that here). In Java, the bit-depth is determined by the java.awt.image.ColorModel.

2. Format

The format affects the compression used. Broadly your choices are:

  • JPEG- a lossy format, designed for compression of photographs, it's typically 24bit RGB (8bit and 32bit are possible but poorly supported by Java). JPEG is bad choice for anything other than photographs as it adds noise to the image: Here's a JPEG conversion of a PDF containing text, and a blown-up version of the same image (with the contrast increased a bit to better illustrate the point):

    Fig 1. JPEG Image (7.1KB)

    Fig 2. Zoom of JPEG
  • PNG- for text and lineart destined for the web, PNG is an excellent choice. Images can be saved as 24bit RGB or as indexed images of 8bit or less.
    Fig 3. 24bit PNG (7.5KB)
    24bit PNG is the best choice for lossless reproduction of RGB content, but the file sizes can be large.
    Fig 4. 8bit PNG (4.5KB)
    8bit PNG can reproduce 256 colors and is very similar to GIF but with better compression. For grayscale images it's lossless.
    Fig 5. 4bit PNG (2.5KB)
    4bit PNG can reproduce 16 colors, which is still plenty for a Grayscale image as you can see here. Other bit depths are possible too.
  • TIFF- almost always the best choice unless you're viewing your images on the web, where it's poorly supported. TIFF is a container format and can contain multiple images with different compression types. The BFO PDF Library can create 1bit Images compressed with CCITT Group 4 compression, 24bit RGB with LZW compression (similar to PNG) or 32bit CMYK with LZW compression. These are the best options available in baseline TIFF for those bit-depths

3. Resolution

The number of dots-per-inch of the output image is arguably the single most important factorthat determines the size of the bitmap and the time required to create it. For 1bit images, 200dpi (fax resolution) is typical, and 300dpi gives good results when printed and is usually enough. Lower resolutions may be appropriate for grayscale or color documents, but while they may look OK on screen they'll look blocky when printed.

How to convert a PDF to a TIFF image

If you chose TIFF for your image format, creating the TIFF file is easy:

PDFParser parser = new PDFParser(pdf);
ColorModel cm = PDFParser.BLACKANDWHITE;
int dpi = 200;
OutputStream out = new FileOutputStream("out.tif");
parser.writeAsTIFF(out, cm, dpi);
out.close();
Example 1. Creating a multi-page TIFF image

This will create a 200dpi 1bit CCITT TIFF image, which is pretty typical for black and white documents. A multi-page TIFF will be created if the PDF has multiple pages, but if you'd prefer to create a single TIFF image for each page:

PDFParser parser = new PDFParser(pdf);
ColorModel cm = PDFParser.BLACKANDWHITE;
int dpi = 200;
List copy = new ArrayList(pdf.getPages());
for (int i=0;i<pages.size();i++) {
    pdf.getPages().clear();
    pdf.getPages().add(copy.get(i));
    OutputStream out = new FileOutputStream("page"+i+".tif");
    parser.writeAsTIFF(out, cm, dpi);
    out.close();
}
Example 2. Creating single-page TIFF images

The ColorModel determines the type of image created - the PDFParserclass has models for RGB, CMYK and two ways of creating a 1bit image: BLACKANDWHITE or getBlackAndWhiteColorModel(). Which of these last two is fastest is theoreticallydependent on the JVM and operating system:

Environment BLACKANDWHITE getBlackAndWhiteColorModel
OS X/Apple Java 1.5 10s 25s
OS X/Apple Java 1.6 10s 24s
Linux/Sun Java 1.5 5s 25s
Linux/Sun Java 1.6 5s 20s
Windows/Sun Java 1.4 6s 27s
Windows/Sun Java 1.5 6s 24s
Windows/Sun Java 1.6 5s 21s
Table 1. Time taken for conversion of 24 pages to TIFF at 200dpi on 2Ghz machine

I say "theoretically" because the above table looks pretty clear cut! I'm convinced we got different results last time we benchmarked this. Do check which is fastest on your system yourself.

How to create a PNG or JPEG image?

If you want to create a bitmap other than a TIFF, you need to use the javax.imageiopackage added in Java 1.4 (there other approaches, but this is the standard one). Broadly it goes as follows:
PDFParser parser = new PDFParser(pdf);
ColorModel cm = PDFParser.RGB;
int dpi = 200;
for (int i=0;i<pdf.getNumberOfPages();i++) {
    PagePainter painter = parser.getPagePainter(i);
    BufferedImage image = painter.getImage(dpi, cm);
    ImageIO.write(image, "PNG", new File("page"+i+".png"));
}
Example 3. Creating PNG images for each page

Choosing the colormodel will determine what sort of output you get. The above example will create a 24bit PNG image, but if you wanted to create the image as an 8bit or 4bit grayscale PNG:

// 8-bit grayscale index
byte[] v = new byte[256];
for (int i=0;i<256;i++) v[i] = (byte)i;
ColorModel cm = new IndexColorModel(8, 256, v, v, v);

// 4-bit grayscale index
byte[] v = new byte[16];
for (int i=0;i<16;i++) v[i] = (byte)Math.max(255, i<<4);
ColorModel cm = new IndexColorModel(4, 16, v, v, v);
Example 4. ColorModels to create index PNG images

You can use the same approach as above but specify "JPEG" as the format to ImageIO.writeto create JPEG images, and other formats are possible too. If you want to use an indexed PNG image that's not grayscale you can either choose the colors yourself in advance, or you can "quantize" the image down from 24bit to 8bit or less. This requires an external package such as the one supplied by GIF4Jor the public domain ImageJprojects (here's the source codefor their quanitizer).