Smaller, faster, better: The importance of choosing correctly
We're asked some questions more than others at BFO, and one of the most common concerns conversion of a PDF to a bitmap image format - typically TIFF, but sometimes JPEG or another format. This process is called "rasterization" and while it's very easy to do with the "extended plus viewer" version of our PDF Library, it's worth going over in more detail.
Summary for the impatient
How fast your rasterization runs depends on the resolution of the output bitmap, the compression used and the complexity of the source document, in decreasing order of significance. For text, typically a 200dpi 1-bit TIFF image is the way to go. JPEG is the wrong answer.
In order to get small files and fast, acceptable results you need to decide three things: bit depth, format and resolution. File size is dependent on these factors, and speed is dependent on these factors and the contents of the page you're rasterizing: a page with large images in unusual ColorSpaces will slow things down, even if you're rendering to a thumbnail size output.
1. Bit Depth
This is the number of bits required to represent each pixel. Values typically range from 1 up to 8, then 24 and 32. 1bit gives you black and white only, with values up to 8bit giving up to 256 unique colors. 24bit is RGB and 32bit is CMYK (and you can add an optional alpha (transparency) channel to both of these for 32bit or 40bit images, although we won't cover that here). In Java, the bit-depth is determined by the java.awt.image.ColorModel.
2. Format
The format affects the compression used. Broadly your choices are:
-
JPEG- a lossy format, designed for compression of
photographs, it's typically 24bit RGB (8bit and 32bit are possible
but poorly supported by Java). JPEG is bad choice for anything
other than photographs as it adds noise to the image: Here's a JPEG
conversion of a PDF containing text, and a blown-up version of the
same image (with the contrast increased a bit to better illustrate
the point):
-
PNG- for text and lineart destined for the web,
PNG is an excellent choice. Images can be saved as 24bit RGB or as
indexed images of 8bit or less.
24bit PNG is the best choice for lossless reproduction of RGB content, but the file sizes can be large. 8bit PNG can reproduce 256 colors and is very similar to GIF but with better compression. For grayscale images it's lossless. 4bit PNG can reproduce 16 colors, which is still plenty for a Grayscale image as you can see here. Other bit depths are possible too. - TIFF- almost always the best choice unless you're viewing your images on the web, where it's poorly supported. TIFF is a container format and can contain multiple images with different compression types. The BFO PDF Library can create 1bit Images compressed with CCITT Group 4 compression, 24bit RGB with LZW compression (similar to PNG) or 32bit CMYK with LZW compression. These are the best options available in baseline TIFF for those bit-depths
3. Resolution
The number of dots-per-inch of the output image is arguably the single most important factorthat determines the size of the bitmap and the time required to create it. For 1bit images, 200dpi (fax resolution) is typical, and 300dpi gives good results when printed and is usually enough. Lower resolutions may be appropriate for grayscale or color documents, but while they may look OK on screen they'll look blocky when printed.
How to convert a PDF to a TIFF image
If you chose TIFF for your image format, creating the TIFF file is easy:
PDFParser parser = new PDFParser(pdf); ColorModel cm = PDFParser.BLACKANDWHITE; int dpi = 200; OutputStream out = new FileOutputStream("out.tif"); parser.writeAsTIFF(out, cm, dpi); out.close();
This will create a 200dpi 1bit CCITT TIFF image, which is pretty typical for black and white documents. A multi-page TIFF will be created if the PDF has multiple pages, but if you'd prefer to create a single TIFF image for each page:
PDFParser parser = new PDFParser(pdf); ColorModel cm = PDFParser.BLACKANDWHITE; int dpi = 200; List copy = new ArrayList(pdf.getPages()); for (int i=0;i<pages.size();i++) { pdf.getPages().clear(); pdf.getPages().add(copy.get(i)); OutputStream out = new FileOutputStream("page"+i+".tif"); parser.writeAsTIFF(out, cm, dpi); out.close(); }
The ColorModel determines the type of image created - the PDFParserclass has models for RGB, CMYK and two ways of creating a 1bit image: BLACKANDWHITE or getBlackAndWhiteColorModel(). Which of these last two is fastest is theoreticallydependent on the JVM and operating system:
Environment | BLACKANDWHITE | getBlackAndWhiteColorModel |
---|---|---|
OS X/Apple Java 1.5 | 10s | 25s |
OS X/Apple Java 1.6 | 10s | 24s |
Linux/Sun Java 1.5 | 5s | 25s |
Linux/Sun Java 1.6 | 5s | 20s |
Windows/Sun Java 1.4 | 6s | 27s |
Windows/Sun Java 1.5 | 6s | 24s |
Windows/Sun Java 1.6 | 5s | 21s |
I say "theoretically" because the above table looks pretty clear cut! I'm convinced we got different results last time we benchmarked this. Do check which is fastest on your system yourself.
How to create a PNG or JPEG image?
If you want to create a bitmap other than a TIFF, you need to use the javax.imageiopackage added in Java 1.4 (there other approaches, but this is the standard one). Broadly it goes as follows:PDFParser parser = new PDFParser(pdf); ColorModel cm = PDFParser.RGB; int dpi = 200; for (int i=0;i<pdf.getNumberOfPages();i++) { PagePainter painter = parser.getPagePainter(i); BufferedImage image = painter.getImage(dpi, cm); ImageIO.write(image, "PNG", new File("page"+i+".png")); }
Choosing the colormodel will determine what sort of output you get. The above example will create a 24bit PNG image, but if you wanted to create the image as an 8bit or 4bit grayscale PNG:
// 8-bit grayscale index byte[] v = new byte[256]; for (int i=0;i<256;i++) v[i] = (byte)i; ColorModel cm = new IndexColorModel(8, 256, v, v, v); // 4-bit grayscale index byte[] v = new byte[16]; for (int i=0;i<16;i++) v[i] = (byte)Math.max(255, i<<4); ColorModel cm = new IndexColorModel(4, 16, v, v, v);
You can use the same approach as above but specify "JPEG" as the
format to
ImageIO.write
to create JPEG images, and other formats
are possible too. If you want to use an indexed PNG image that's
not grayscale you can either choose the colors yourself in advance,
or you can "quantize" the image down from 24bit to 8bit or less.
This requires an external package such as the one supplied by
GIF4Jor
the public domain
ImageJprojects (here's the
source codefor their quanitizer).