Have you ever needed to extract a graph, figure or image from a PDF but you have lost
the original file? A .docx
,
.tex
or whichever format you used before converting it to a PDF file? Well, here's how
you can
programmatically extract images from a PDF with BFOs PDF Library.
We will even show you how to create a
small web-service that allows anybody to
use our Java PDF Library to extract pictures from any PDF file.
Fetching images from PDFs
The first step is to open and read the PDF file:
// here is how to open the file.
InputStream inputStream; //get a URL or a path to a PDF file.
// if URL:
inputStream = new URL("some.example.com/here-is-a-pdf.pdf");
// if path:
inputStream = new FileInputStream("./path-to/a-pdf.pdf");
// here is how to read it as PDF
PDFReader reader = new PDFReader(stream);
PDF pdf = new PDF(reader);
// we will need a parser too
PDFParser parser = new PDFParser(pdf);
Once we have a PDF
object, we can list the pages and search for images. We will need a list to store
the images as they are located. Once again, BFO uses well-known data structure that
make us feel at home:
// get a list of all the pages
List<PDFPage> pages = pdf.getPages();
// create a new list to hold the images we find
List<PageExtractor.Image> allImgs = new ArrayList<>();
// browse each page:
for (PDFPage page :
pages) {
// use the parser to get a PageExtractor
PageExtractor extract = parser.getPageExtractor(page);
// get the images from the PageExtractor
Collection<PageExtractor.Image> imgs = extract.getImages();
// add all the images we found to the list
allImgs.addAll(imgs);
}
// we now have a list of all the images that were found in the file!
Processing images
We have a list of Page Extractor images in our API located at PageExtractor.Image
.
Now we'll show you how to turn this object type into an actual image file.
The basic principle is as follows:
PageExtractor.Image bfoImg = // one of the picture we got at the previous step
// first step get the RenderedImage object from the BFO Image.
RenderedImage img = bfoImg.getImage();
// Create a FileOutputStream so we can store the image somewhere
FileOutputStream fileOutputStream = new FileOutputStream(path + '/filename.png');
// Write it.
ImageIO.write(img, "png", fileOutputStream);
We take an image extracted by the PDF Library, create a FileOutputStream
and then use the built-in ImageIO
write
class to convert it from PageExtractor.Image
to a regular bit format such as PNG, JPEG or GIF.
There is however a caveat with OpenJDK. It does not allow using this simple mechanism
to write JPEG files.
You will need to change the color model of the picture and resolve to using BufferedImage
and Graphics2D
to do so.
Here is how, adapted from this stackoverflow post:
PageExtractor.Image bfoImg = // one of the picture we got at the previous step
// take the RenderedImage as before
RenderedImage img = bfoImg.getImage();
// create a BufferedImage with a color model that works with OpenJDK
BufferedImage bufferedImage = new BufferedImage(img.getWidth(), img.getHeight(), BufferedImage.TYPE_3BYTE_BGR);
// get a Graphics2D object from the buffered image so we can "paint" it
Graphics2D g2d = bufferedImage.createGraphics();
// "paint" the graphic (the second parameter is a transformation)
g2d.drawRenderedImage(img, null);
// we don't need g2d anymore, let's release the resources
g2d.dispose();
// same as before create a place to write the image to
FileOutputStream fileOutputStream = new FileOutputStream(path + '/filename.jpg');
// we can now perform the JPEG writing!
ImageIO.write(bufferedImage, "jpg", fileOutputStream);
In the next blog post, we will look at creating an image extraction webapp using the PDF Library API.
Leo JeussetFreelance developer and BFO guest blogger
https://twitter.com/leojpod