<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="blog.xsl"?>
<article>
 <title>Text Extraction Using BFOs PDF Library</title>
 <subtitle/>
 <excerpt>How to extract text from a PDF using BFO's PDF Library API. We will show you with code examples how it can be done.</excerpt>
 <time>2014-07-08T15:55:00</time>
 <author>leo</author>
 <category>pdf</category>
 <tags>text extraction</tags>
 <body>
  <p>
   Text extraction is often a key point of PDF processing.  It can be used for archiving, form parsing or anything else you desire but at some point you
   will want to extract meaningful information from a PDF file.  Sadly there is no magic formula when it comes to extracting particular pieces of
   information. One often has to handle the problem on a case-by-case basis. For instance, let's see how we can extract information such as name, email
   address and other customer details from this fictional "Hostel registration" <a href="listing.pdf">file</a>.
  </p>
  <h2>First try</h2>
  <p>
   The first step is to see if the information we want is available in the document. To do that, the easiest way is to ask the PDF Library to extract
   every piece of information it can get from the doc by using the method
   <a href="/products/pdf/docs/api/org/faceless/pdf2/PageExtractor.html#getTextInDisplayOrder()"><code>getTextInDisplayOrder</code></a>.
  </p>
  <p>
   Here's how it looks:
  </p>
  <pre class="brush:java">
PDF listing = new PDF(new PDFReader(new File("listing.pdf")));
PDFParser parser = new PDFParser(listing);
PageExtractor pageExtractor = parser.getPageExtractor(0);
for (PageExtractor.Text t: pageExtractor.getTextInDisplayOrder()) {
    System.out.println(t.getText());
}
  </pre>
  <p>
   If you used the sample document when running this small example you should be able to see both the good news and the bad news. The good news: every
   cell of the table is extracted on its own! There is no need for us to implement some text splitting and try to guess where the name stops or where
   the email starts. The bad news: it is not going to be easy to know which
   <a href="/products/pdf/docs/api/org/faceless/pdf2/PageExtractor.Text.html">PageExtractor.Text</a> object is what. So now what?
  </p>
  <h2>Second try: seek and extract</h2>
  <p>
   When in doubt, check the documentation! In it you will see two methods for <code>Text</code> that will help us:
   <a href="/products/pdf/docs/api/org/faceless/pdf2/PageExtractor.Text.html#getRowNext()">getRowNext</a> and
   <a href="/products/pdf/docs/api/org/faceless/pdf2/PageExtractor.Text.html#getRowPrevious()">getRowPrevious</a>.
   We will also use regular expression and the <a href="/products/pdf/docs/api/org/faceless/pdf2/PageExtractor.html">PageExtractor</a> method
   <a href="/products/pdf/docs/api/org/faceless/pdf2/PageExtractor.html#getMatchingText(java.util.regex.Pattern)">getMatchingText</a>.
   By isolating the email column (easy to match with a regular expression), we can now easily handle rows one by one:
  </p>
<pre class="brush:java">
Pattern emailCatcher = Pattern.compile("^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$", Pattern.CASE_INSENSITIVE);
PageExtractor pageExtractor = parser.getPageExtractor(0);
for (Text t : pageExtractor.getMatchingText(emailCatcher)) {
    // each row has a mail address -&gt; easy to spot
    // proceed to the row extraction :
    // a row looks like that :
    // | name | em@ai.l | arrival | departure | (note) |

    // t is pointing on the email bloc, extract text right away.
    String email = t.getText();

    // then let's look for the name bloc
    Text nameHolder = t.getRowPrevious();
    String name = nameHolder.getText();

    // now let's access the arrival and departure block:

    Text arrivalHolder = t.getRowNext();
    String arrivalString = arrivalHolder.getText();
    // split arrivalString (e.g. 11/11/2011) to [ 11, 11, 2011]
    String[] arrival = arrivalString.split("/");

    Text departureHolder = arrivalHolder.getRowNext();
    String departureString = departureHolder.getText();
    String[] departure = departureString.split("/");

    String notes = null;
    Text notesHolder = departureHolder.getRowNext();
    if (notesHolder != null) {
        notes = notesHolder.getText();
    }
}
</pre>
  <p>And job done!</p>
  <p>For the complete example look <a viewtext="true" href="ListParser.java">here</a>.</p>  
 </body>
</article>