PDF Text Extraction in Java

Most PDF documents are not editable making conversion of a PDF to text a tedious, if not impossible task, especially if the solution involves mass processing of PDF documents.

We incorporated text extraction functionality in our Java PDF Library way back in 2005 with the release of version 2.6.2. Text extraction is done using the PageExtractor class. You can get one of these by calling the PDFParser.getPageExtractor() method for the appropriate page.

It’s important to remember text extraction from PDF is not an exact science. The reasons include:

  • Fonts may have no Unicode information or be encoded incorrectly, making it impossible to convert back to text
  • Text may be rendered as an image, making extraction impossible
  • Layout may rely on visual features of the glyphs. Determining how much space is between the end of one glyph and how much is before the start of the next is not always obvious
  • When determining layout, features like superscript or subscript, overlaid or rotated text cannot be accurately represented in plain text

Therefore results cannot be guaranteed as 100% accurate. Nevertheless most modern PDFs can have their content extracted reliably, with layout approximating the PDF as closely as possible in plain text.

Other methods in the PageExtractor class allow the list of text objects to be returned as a Collection, which not only provides information on exact page position, color and font, but can be modified to improve the text extraction process (for example, if you know don't want any rotated text to be extracted, you can delete all the rotated text objects before running getTextAsStringBuffer).

Who Needs Text Extraction?

Business enviroments involved with data mining, content management systems and form processing will find text extraction particularly useful. Text extraction can assist with:

  • Archiving: Text and their components can be extracted so documents can be indexed and archived while being fully-searchable
  • Extract and process data in forms
  • Extract information such as invoice data, mailing addresses and phone numbers for administration purposes
  • Extract photos and images

You can see the results for yourself by downloading a free fully functional trial version of the PDF Library.