Text Extraction Using BFOs PDF Library

Text extraction is often a key point of PDF processing. It can be used for archiving, form parsing or anything else you desire but at some point you will want to extract meaningful information from a PDF file. Sadly there is no magic formula when it comes to extracting particular pieces of information. One often has to handle the problem on a case-by-case basis. For instance, let's see how we can extract information such as name, email address and other customer details from this fictional "Hostel registration" file.

First try

The first step is to see if the information we want is available in the document. To do that, the easiest way is to ask the PDF Library to extract every piece of information it can get from the doc by using the method getTextInDisplayOrder.

Here's how it looks:

PDF listing = new PDF(new PDFReader(new File("listing.pdf")));
PDFParser parser = new PDFParser(listing);
PageExtractor pageExtractor = parser.getPageExtractor(0);
for (PageExtractor.Text t: pageExtractor.getTextInDisplayOrder()) {
    System.out.println(t.getText());
}

If you used the sample document when running this small example you should be able to see both the good news and the bad news. The good news: every cell of the table is extracted on its own! There is no need for us to implement some text splitting and try to guess where the name stops or where the email starts. The bad news: it is not going to be easy to know which PageExtractor.Text object is what. So now what?

Second try: seek and extract

When in doubt, check the documentation! In it you will see two methods for Text that will help us: getRowNext and getRowPrevious. We will also use regular expression and the PageExtractor method getMatchingText. By isolating the email column (easy to match with a regular expression), we can now easily handle rows one by one:

Pattern emailCatcher = Pattern.compile("^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$", Pattern.CASE_INSENSITIVE);
PageExtractor pageExtractor = parser.getPageExtractor(0);
for (Text t : pageExtractor.getMatchingText(emailCatcher)) {
    // each row has a mail address -> easy to spot
    // proceed to the row extraction :
    // a row looks like that :
    // | name | em@ai.l | arrival | departure | (note) |

    // t is pointing on the email bloc, extract text right away.
    String email = t.getText();

    // then let's look for the name bloc
    Text nameHolder = t.getRowPrevious();
    String name = nameHolder.getText();

    // now let's access the arrival and departure block:

    Text arrivalHolder = t.getRowNext();
    String arrivalString = arrivalHolder.getText();
    // split arrivalString (e.g. 11/11/2011) to [ 11, 11, 2011]
    String[] arrival = arrivalString.split("/");

    Text departureHolder = arrivalHolder.getRowNext();
    String departureString = departureHolder.getText();
    String[] departure = departureString.split("/");

    String notes = null;
    Text notesHolder = departureHolder.getRowNext();
    if (notesHolder != null) {
        notes = notesHolder.getText();
    }
}

And job done!

For the complete example look here.

Tags: text extraction

Posted by Leo Jeusset on 08 Jul 2014 at 15:55

Previous Article Next Article New Comment Back to index

Name
Email
Subject