Extracting Images from PDF with the BFO PDF Library

Have you ever needed to extract a graph, figure or image from a PDF but you have lost the original file? A .docx, .tex or whichever format you used before converting it to a PDF file? Well, here's how you can programmatically extract images from a PDF with BFOs PDF Library. We will even show you how to create a small web-service that allows anybody to use our Java PDF Library to extract pictures from any PDF file.

Fetching images from PDFs

The first step is to open and read the PDF file:

// here is how to open the file.
     InputStream inputStream; //get a URL or a path to a PDF file.
     // if URL:
     inputStream = new URL("some.example.com/here-is-a-pdf.pdf");
     // if path:
     inputStream = new FileInputStream("./path-to/a-pdf.pdf");
     // here is how to read it as PDF
     PDFReader reader = new PDFReader(stream);
     PDF pdf = new PDF(reader);
     // we will need a parser too
     PDFParser parser = new PDFParser(pdf);

Once we have a PDF object, we can list the pages and search for images. We will need a list to store the images as they are located. Once again, BFO uses well-known data structure that make us feel at home:

// get a list of all the pages
     List<PDFPage> pages = pdf.getPages();
     // create a new list to hold the images we find
     List<PageExtractor.Image> allImgs = new ArrayList<>();
     // browse each page:
     for (PDFPage page :
             pages) {
         // use the parser to get a PageExtractor
         PageExtractor extract = parser.getPageExtractor(page);
         // get the images from the PageExtractor
         Collection<PageExtractor.Image> imgs = extract.getImages();
         // add all the images we found to the list
     // we now have a list of all the images that were found in the file!

Processing images

We have a list of Page Extractor images in our API located at PageExtractor.Image. Now we'll show you how to turn this object type into an actual image file.

The basic principle is as follows:

PageExtractor.Image bfoImg = // one of the picture we got at the previous step
     // first step get the RenderedImage object from the BFO Image.
     RenderedImage img = bfoImg.getImage();
     // Create a FileOutputStream so we can store the image somewhere
     FileOutputStream fileOutputStream = new FileOutputStream(path + '/filename.png');
     // Write it.
     ImageIO.write(img, "png", fileOutputStream);

We take an image extracted by the PDF Library, create a FileOutputStream and then use the built-in ImageIO write class to convert it from PageExtractor.Image to a regular bit format such as PNG, JPEG or GIF. There is however a caveat with OpenJDK. It does not allow using this simple mechanism to write JPEG files. You will need to change the color model of the picture and resolve to using BufferedImage and Graphics2D to do so. Here is how, adapted from this stackoverflow post:

PageExtractor.Image bfoImg = // one of the picture we got at the previous step
     // take the RenderedImage as before
     RenderedImage img = bfoImg.getImage();
     // create a BufferedImage with a color model that works with OpenJDK
     BufferedImage bufferedImage = new BufferedImage(img.getWidth(), img.getHeight(), BufferedImage.TYPE_3BYTE_BGR);
     // get a Graphics2D object from the buffered image so we can "paint" it
     Graphics2D g2d = bufferedImage.createGraphics();
     // "paint" the graphic (the second parameter is a transformation)
     g2d.drawRenderedImage(img, null);
     // we don't need g2d anymore, let's release the resources
     // same as before create a place to write the image to
     FileOutputStream fileOutputStream = new FileOutputStream(path + '/filename.jpg');
     // we can now perform the JPEG writing!
     ImageIO.write(bufferedImage, "jpg", fileOutputStream);

In the next blog post, we will look at creating an image extraction webapp using the PDF Library API.

Leo Jeusset
Freelance developer and BFO guest blogger