PDF Text Extraction in Java

Most PDF documents are not editable making conversion of a PDF to text a tedious, if not impossible task, especially if the solution involves mass processing of PDF documents.

We incorporated text extraction functionality in our Java PDF Library way back in 2005 with the release of version 2.6.2. Text extraction is done using the PageExtractor class. You can get one of these by calling the PDFParser.getPageExtractor() method for the appropriate page.

It’s important to remember text extraction from PDF is not an exact science. The reasons include:

Fonts may have no Unicode information or be encoded incorrectly, making it impossible to convert back to text
Text may be rendered as an image, making extraction impossible
Layout may rely on visual features of the glyphs. Determining how much space is between the end of one glyph and how much is before the start of the next is not always obvious
When determining layout, features like superscript or subscript, overlaid or rotated text cannot be accurately represented in plain text

Therefore results cannot be guaranteed as 100% accurate. Nevertheless most modern PDFs can have their content extracted reliably, with layout approximating the PDF as closely as possible in plain text.

Other methods in the PageExtractor class allow the list of text objects to be returned as a Collection, which not only provides information on exact page position, color and font, but can be modified to improve the text extraction process (for example, if you know don't want any rotated text to be extracted, you can delete all the rotated text objects before running getTextAsStringBuffer).

Who Needs Text Extraction?

Business enviroments involved with data mining, content management systems and form processing will find text extraction particularly useful. Text extraction can assist with:

Archiving: Text and their components can be extracted so documents can be indexed and archived while being fully-searchable
Extract and process data in forms
Extract information such as invoice data, mailing addresses and phone numbers for administration purposes
Extract photos and images

You can see the results for yourself by downloading a free fully functional trial version of the PDF Library.

Tags: extraction

Posted by Dan Wilson on 16 Nov 2011 at 04:06

Comments

Re: PDF Text Extraction in Java by Jessica on 29 Nov 2011 at 11:11
You're doing a great work :) Your PDF library has the BEST text extraction across any other java library, and I've tested ALL the libraries I've found with Google Kisses, Jessica.
- Re: PDF Text Extraction in Java by Mike Bremford on 30 Nov 2011 at 10:11
  Thanks Jessica - very good of you to say so while we still have exceptions in some of your documents! We're working on those now.

Previous Article Next Article New Comment Back to index

Name
Email
Subject