java.lang.Object
- org.faceless.pdf2.PageExtractor

```
public class PageExtractor
extends Object
```
This class enables the extraction of text and images from a PDFPage. You can get one by calling the PDFParser.getPageExtractor(int) method, assuming the PDF has the rights to let you extract text and/or images.
Once you've got one, you can extract the text of the page as a StringBuffer by calling getTextAsStringBuffer(). Note that extracting text from PDF's is not an exact science - the internals of a PDF allow text to be displayed in any order, and features like superscript, subscript, rotated text and so on which are easy to display in PDF can only be approximated in plain text.
Features like tables etc. have to be determined using heuristics, and some PDF's are encoded in a way that makes extracting their text almost impossible (storing each letter as an image, for example).
Depending on how the font has been stored, the library may replace unknown characters with a Unicode character in the private range (U+EF00 - U+EFFF). These replacements will be consistent, so if you find that U+EF01 is in fact the letter 'A', you can easily run a String.replace() on the string to correct the letters
Extracting BitMap images is a much simpler process. The PageExtractor.Image class represents an image on the current page. There is one instance for each time an image is drawn, although as an image is repeated each instance may contain the same RenderedImage. You can retrieve the list of images by calling the getImages() method.
This class requires the Extended Edition plus Viewer license to operate. Although it may be freely used in the trial version of the library, the extracted text will have the letter 'e' replaced with the letter 'a'.

Since:

2.6.2

Nested Class Summary

Nested Classes
Modifier and Type	Class	Description
`class`	`PageExtractor.Image`	A class representing a bitmap image which is extracted from the `PageExtractor`.
`class`	`PageExtractor.Text`	A class representing a piece of text which is extracted from the `PageExtractor`.

Field Summary

Fields
Modifier and Type	Field	Description
`static Comparator<PageExtractor.Text>`	`DISPLAYORDER`	A Comparator which can be used to sort `PageExtractor.Text` objects into their "display" order - the order which they visibly appear on the page, and the order that is returned by `getTextInDisplayOrder()`
`static Comparator<PageExtractor.Text>`	`NATURALORDER`	A Comparator which can be used to sort `PageExtractor.Text` objects into their "natural" order - the order which they occur in the PDF page stream, and the order that is returned by `getTextUnordered()`

Method Summary

All Methods Static Methods Instance Methods Concrete Methods Deprecated Methods
Modifier and Type	Method	Description
`static Collection<PageExtractor.Text>`	`cropText(Collection<PageExtractor.Text> all, Shape shape)`	Given a Collection of `PageExtractor.Text` items, as returned by `getMatchingText()`, `getTextUnordered()` or `getTextInDisplayOrder()`, return a new Collection which contains only Text that falls completely inside the specified `Shape`.
`Collection<PageExtractor.Image>`	`getImages()`	Return every `PageExtractor.Image` on the page, in the order they were added to the page.
`Collection<PageExtractor.Text>`	`getMatchingNormalizedText(String[] queries, boolean caseinsensitive)`	Returns a Collection of `PageExtractor.Text` objects on this page that match any of the specified substrings, based on normalized text.
`Collection<PageExtractor.Text>`	`getMatchingNormalizedText(Pattern pattern)`	Returns a Collection of `PageExtractor.Text` objects on this page that match the specified regular expression, based on normalized text.
`Collection<PageExtractor.Text>`	`getMatchingText(String query)`	Return a Collection of `PageExtractor.Text` items on this page that are equal to the specified substring.
`Collection<PageExtractor.Text>`	`getMatchingText(String[] queries)`	Return a Collection of `PageExtractor.Text` items on this page that are equals to one of the specified substrings.
`Collection<PageExtractor.Text>`	`getMatchingText(String[] queries, boolean caseinsensitive)`	Return a Collection of `PageExtractor.Text` items on this page that are equals to one of the specified substrings.
`Collection<PageExtractor.Text>`	`getMatchingText(Pattern pattern)`	Return a Collection of `PageExtractor.Text` items on this page that match the specified Regular Expression.
`PDFPage`	`getPage()`	Return the `PDFPage` this PageExtractor relates to
`AttributedString`	`getStyledText(PageExtractor.Text first, int firstchar, PageExtractor.Text last, int lastchar, boolean displayorder)`	Deprecated. see `getStyledText(Text, int, Text, int, Comparator)`
`AttributedString`	`getStyledText(PageExtractor.Text first, int firstchar, PageExtractor.Text last, int lastchar, Comparator<PageExtractor.Text> order)`	Return an AttributedString containing a contiguous range of text from this PageExtractor.
`Collection<PageExtractor.Text>`	`getText(Comparator<PageExtractor.Text> comp)`	Return every `PageExtractor.Text` item on the page, in the specified order.
`StringBuffer`	`getText(PageExtractor.Text first, int firstchar, PageExtractor.Text last, int lastchar, boolean displayorder)`	Deprecated. see `getStyledText(Text, int, Text, int, Comparator)`
`StringBuffer`	`getText(PageExtractor.Text first, int firstchar, PageExtractor.Text last, int lastchar, Comparator<PageExtractor.Text> order)`	Return a StringBuffer containing a contiguous range of text from this PageExtractor.
`StringBuffer`	`getTextAsStringBuffer()`	Parse and return all the text on the page as a StringBuffer.
`StringBuffer`	`getTextAsStringBuffer(float x1, float y1, float x2, float y2)`	Parse and return the text in the specified area on the page as a String.
`Collection<PageExtractor.Text>`	`getTextInDisplayOrder()`	Return every `PageExtractor.Text` item on the page, in the order they are displayed on the screen - so the first item in the returned collection will nearest to the top left of the page.
`Collection<PageExtractor.Text>`	`getTextUnordered()`	Return every `PageExtractor.Text` item on the page, in the order they were added to the page.
`boolean`	`isExtracted()`	Return true if the extraction has been run, false otherwise.
`void`	`setOption(String key, Object value)`	Set an option to control text extraction.
`void`	`setSpaceTolerance(double zero, double one, double many)`	Set the "space tolerance" - tunable parameters for the extractor to determine when two adjacent phrases of text are to be separated by zero, one or more than one space.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Detail
- NATURALORDER
```
public static final Comparator<PageExtractor.Text> NATURALORDER
```
  A Comparator which can be used to sort PageExtractor.Text objects into their "natural" order - the order which they occur in the PDF page stream, and the order that is returned by getTextUnordered()
  
  Since:
  
  2.11.26
- DISPLAYORDER
```
public static final Comparator<PageExtractor.Text> DISPLAYORDER
```
  A Comparator which can be used to sort PageExtractor.Text objects into their "display" order - the order which they visibly appear on the page, and the order that is returned by getTextInDisplayOrder()
  
  Since:
  
  2.11.26

Method Detail

isExtracted
```
public boolean isExtracted()
```
Return true if the extraction has been run, false otherwise.

Since:

2.11.7

setOption

public void setOption(String key,
                      Object value)

Set an option to control text extraction. These options are useful for specific cases, such as when working with the Adobe Highlight File Format but probably won't be required for general use. The following options are recognised.

IgnoreArtifacts	all \| regex	Set to all to ignore any text flagged as an "artifact" in the PDF stream, or set to a regular expression to match against artifacts to be ignored.
RawText	true \| false	Set to true to prevent any post-processing of the extracted text. If this option is set, `getTextInDisplayOrder`, `getTextAsStringBuffer` and similar methods cannot be used - only `getTextUnordered()` will work

Since:: 2.11.12

setSpaceTolerance
```
public void setSpaceTolerance(double zero,
                              double one,
                              double many)
```
Set the "space tolerance" - tunable parameters for the extractor to determine when two adjacent phrases of text are to be separated by zero, one or more than one space. Typically this won't need to be tuned by the end user, but if you find the spacing between the extracted text is less than ideal, you can tune it to some degree with this method. The perfect value will depend on the font, the language, layout, line justification and kerning.
The values are multipliers of the width of the "space" character - so 1 means "the width of one space". Typically the parameter to tune is "one", within a rough range of 0.25 to 0.7 - reduce it if you find words are being joined together, and increase it if words are being split into two. it.

Parameters:

zero - how far apart two characters should be in order for them to be joined by zero spaces. The default value is -0.5

one - at least how far apart two characters must be in order to be joined by one space. The default value is 0.666.

many - at least how far apart two characters must be in order to be considered separate pieces of text. The default value is 1.5

Since:

2.11.2

getImages
```
public Collection<PageExtractor.Image> getImages()
```
Return every PageExtractor.Image on the page, in the order they were added to the page. Some images may be displayed more than once, in which case the value returned by PageExtractor.Image.getImage() will be identical.

Returns:

an unmodifiable collection of PageExtractor.Image elements.

getTextUnordered
```
public Collection<PageExtractor.Text> getTextUnordered()
```
Return every PageExtractor.Text item on the page, in the order they were added to the page. The ordering may not be consistant with the order items are positioned on screen.

Returns:

an unmodifiable collection of PageExtractor.Text elements.

getTextInDisplayOrder
```
public Collection<PageExtractor.Text> getTextInDisplayOrder()
```
Return every PageExtractor.Text item on the page, in the order they are displayed on the screen - so the first item in the returned collection will nearest to the top left of the page.

Returns:

an unmodifiable collection of PageExtractor.Text elements.

getText
```
public Collection<PageExtractor.Text> getText(Comparator<PageExtractor.Text> comp)
```
Return every PageExtractor.Text item on the page, in the specified order.

Parameters:

comp - one of NATURALORDER or DISPLAYORDER

Returns:

an unmodifiable collection of PageExtractor.Text elements.

Since:

2.11.26

getMatchingText
```
public Collection<PageExtractor.Text> getMatchingText(String query)
```
Return a Collection of PageExtractor.Text items on this page that are equal to the specified substring. The Text items returned from getTextInDisplayOrder() are searched and possibly substrings extracted from them to create this collection. In this case the co-ordinates of the returned Text items will reflect the substring not the original Text object.
As an example, the following method could be used to search a PDF for a specified word and add a "highlight" annotation over it. The PDF can then be rendered or saved as normal.
```
 void highlightWords(PDF pdf, String word) {
   PDFParser parser = new PDFParser(pdf);
   for (int i=0;i<pdf.getNumberOfPages();i++) {
     PageExtractor extractor = parser.getPageExtractor(i);
     Collection co = extractor.getMatchingText(word);
     for (Iterator j = co.iterator();j.hasNext();) {
       PageExtractor.Text text = (PageExtractor.Text)j.next();
       AnnotationMarkup annot = text.createAnnotationMarkup("Highlight");
       text.getPage().getAnnotations().add(annot);
     }
   }
 }
 
```
Parameters:

query - the String to search for

Returns:

a Collection of PageExtractor.Text objects.

Since:

2.6.12

getMatchingText
```
public Collection<PageExtractor.Text> getMatchingText(String[] queries)
```
Return a Collection of PageExtractor.Text items on this page that are equals to one of the specified substrings. This method runs exactly like getMatchingText(String) but allows more than one substring to be matched.

Parameters:

queries - a list of zero or more Strings to search for

Returns:

a Collection of PageExtractor.Text objects.

Since:

2.8.1

getMatchingText
```
public Collection<PageExtractor.Text> getMatchingText(String[] queries,
                                                      boolean caseinsensitive)
```
Return a Collection of PageExtractor.Text items on this page that are equals to one of the specified substrings. This method runs exactly like getMatchingText(String) but allows more than one substring to be matched.

Parameters:

queries - a list of zero or more Strings to search for

caseinsensitive - whether the search should be performed with regard to case

Returns:

a Collection of PageExtractor.Text objects.

Since:

2.11.1

getMatchingNormalizedText
```
public Collection<PageExtractor.Text> getMatchingNormalizedText(String[] queries,
                                                                boolean caseinsensitive)
```
Returns a Collection of PageExtractor.Text objects on this page that match any of the specified substrings, based on normalized text.

Since:

2.28.5

See Also:

getMatchingText(String[], boolean), PageExtractor.Text.getNormalizedText()

getMatchingText
```
public Collection<PageExtractor.Text> getMatchingText(Pattern pattern)
```
Return a Collection of PageExtractor.Text items on this page that match the specified Regular Expression. This is likely to be more efficient than the version of this method that takes multiple-strings.

Parameters:

pattern - the Pattern to search for

Returns:

a Collection of PageExtractor.Text objects.

Since:

2.11

getMatchingNormalizedText
```
public Collection<PageExtractor.Text> getMatchingNormalizedText(Pattern pattern)
```
Returns a Collection of PageExtractor.Text objects on this page that match the specified regular expression, based on normalized text.

Since:

2.28.5

See Also:

getMatchingText(java.util.regex.Pattern), PageExtractor.Text.getNormalizedText()

getTextAsStringBuffer
```
public StringBuffer getTextAsStringBuffer()
```
Parse and return all the text on the page as a StringBuffer. Text will be converted back to it's normalized form, and newlines and spaces will be inserted in an approximation of the original layout.

getTextAsStringBuffer
```
public StringBuffer getTextAsStringBuffer(float x1,
                                          float y1,
                                          float x2,
                                          float y2)
```
Parse and return the text in the specified area on the page as a String. Text will be converted back to its normalized form, and newlines and spaces will be inserted in an approximation of the original layout. The co-ordinates define the start position of any phrases that are to be returned.

Parameters:

x1 - the left-most X co-ordinate of the text

y1 - the top-most Y co-ordinate of the text

x2 - the right-most X co-ordinate of the text

y2 - the bottom-most Y co-ordinate of the text

Returns:

a StringBuffer containing all the text within the specified rectangle

getText
```
public StringBuffer getText(PageExtractor.Text first,
                            int firstchar,
                            PageExtractor.Text last,
                            int lastchar,
                            Comparator<PageExtractor.Text> order)
```
Return a StringBuffer containing a contiguous range of text from this PageExtractor. The range is specified by giving a starting and ending PageExtractor.Text object, and the offsets into those strings: this method will then iterate over the appropriate order (getTextUnordered() or getTextInDisplayOrder()) and include the appropriate range in the output. Note if you're selecting the entire range in display order this ordering is important, as the first/last must be the first/last items in that Collection.

Parameters:

first - the first Text from this PageExtractor to be extracted

firstchar - the index of the first character from "first" to be extracted

last - the last Text from this PageExtractor to be extracted

lastchar - the index after the index of the last character from "last" to be extracted

order - one of NATURALORDER or DISPLAYORDER

Since:

2.10.3

getText

@Deprecated
public StringBuffer getText(PageExtractor.Text first,
                            int firstchar,
                            PageExtractor.Text last,
                            int lastchar,
                            boolean displayorder)

Deprecated.

see getStyledText(Text, int, Text, int, Comparator)

Return a StringBuffer containing a contiguous range of text from this PageExtractor.

Since:: 2.10.3

getStyledText
```
public AttributedString getStyledText(PageExtractor.Text first,
                                      int firstchar,
                                      PageExtractor.Text last,
                                      int lastchar,
                                      Comparator<PageExtractor.Text> order)
```
Return an AttributedString containing a contiguous range of text from this PageExtractor. The range is specified by giving a starting and ending PageExtractor.Text object, and the offsets into those strings: this method will then iterate over the appropriate order (getTextUnordered() or getTextInDisplayOrder()) and include the appropriate range in the output. Note if you're selecting the entire range in display order this ordering is important, as the first/last must be the first/last items in that Collection.

Parameters:

first - the first Text from this PageExtractor to be extracted

firstchar - the first character from "first" to be extracted

last - the last Text from this PageExtractor to be extracted

lastchar - the index after the index of the last character from "last" to be extracted

order - one of NATURALORDER or DISPLAYORDER

Since:

2.11.26

getStyledText

@Deprecated
public AttributedString getStyledText(PageExtractor.Text first,
                                      int firstchar,
                                      PageExtractor.Text last,
                                      int lastchar,
                                      boolean displayorder)

Deprecated.

see getStyledText(Text, int, Text, int, Comparator)

Return an AttributedString containing a contiguous range of text from this PageExtractor.

Since:: 2.11.19

getPage
```
public PDFPage getPage()
```
Return the PDFPage this PageExtractor relates to

Since:

2.10.3

cropText
```
public static Collection<PageExtractor.Text> cropText(Collection<PageExtractor.Text> all,
                                                      Shape shape)
```
Given a Collection of PageExtractor.Text items, as returned by getMatchingText(), getTextUnordered() or getTextInDisplayOrder(), return a new Collection which contains only Text that falls completely inside the specified Shape. For example, to get all the text in a specific rectangle:
```
 Shape rect = new Rectangle2D.Float(0, 0, 100, 100);
 Collection all = extractor.trimToShape(extractor.getTextUnordered(), rect);
 
```
Parameters:

all - a Collection of Text objects

shape - the Shape to trim the text to

Returns:

a new Collection of Text items where every item is completely inside the specified shape

Since:

2.11.8

Class PageExtractor

Nested Class Summary

Field Summary

Method Summary

Methods inherited from class java.lang.Object

Field Detail

NATURALORDER

DISPLAYORDER

Method Detail

isExtracted

setOption

setSpaceTolerance

getImages

getTextUnordered

getTextInDisplayOrder

getText

getMatchingText

getMatchingText

getMatchingText

getMatchingNormalizedText

getMatchingText

getMatchingNormalizedText

getTextAsStringBuffer

getTextAsStringBuffer

getText

getText

getStyledText

getStyledText

getPage

cropText