Package org.faceless.pdf2
Class PageExtractor.Text
- java.lang.Object
-
- org.faceless.pdf2.PageExtractor.Text
-
- All Implemented Interfaces:
Comparable<PageExtractor.Text>
- Enclosing class:
- PageExtractor
public abstract class PageExtractor.Text extends Object implements Comparable<PageExtractor.Text>
A class representing a piece of text which is extracted from thePageExtractor. Each text object has a location on the page, font-size, font-name, color and text.- Since:
- 2.6.2
-
-
Constructor Summary
Constructors Constructor Description Text()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description AnnotationMarkupcreateAnnotationMarkup(String type)Create a newAnnotationMarkupof the specified type to cover this text.floatgetAngle()Return the angle of rotation of this text on the page, in degrees clockwise from 12 o'clock.abstract floatgetBaseline()Return the baseline of the text item, as a fraction between 0 and 1. 0 would indicate the baseline is at the top of the text, 1 at the absolute bottom.abstract intgetByteLength()Get the length of the original text in bytes.abstract intgetByteToCharOffset(int byteoffset)Given a byte offset into the original String, return the Character offset it refers to.abstract PaintgetColor()Return the color of this text, ornullif none was setfloat[]getCorners()Return the four corners (x1,y1) (x2,y2) (x3,y3) (x4,y4) of the quadrilateral that encompasses the text.abstract floatgetEndOffset(int pos)As forgetOffset()but return the end position of that letterabstract ReadergetFontMetaData()Return any XMP MetaData that has been set on the Font, ornullif none exists.abstract StringgetFontName()Return the font name of this textabstract floatgetFontSize()Return the font size of this text in pointsabstract floatgetHorizontalScale()Return an indication of the horizontal scale of the text.floatgetLength()Return the length of this Text in points.abstract PaintgetLineColor()Return the outline color of this text, ornullif none was setabstract StringgetNormalizedText()Return a normalized form of the text, for text comparison purposes while searching.abstract floatgetOffset(int pos)Given an offset into the text, return the start position of that letter.PDFPagegetPage()Return thePDFPagethis text was found on - simply the page the parentPageExtractorwas created from.PageExtractorgetPageExtractor()Return thePageExtractorthis text was created fromabstract PageExtractor.TextgetPrimaryText()If this text is a subtext or collection of Text object, return the primary text it starts with.abstract intgetPrimaryTextOffset()If this text is a subtext or collection of Text object, return the offset into theprimary textwhere it starts.abstract PageExtractor.TextgetRowNext()Return the next Text item in this row, ornullif there are noneabstract PageExtractor.TextgetRowPrevious()Return the next Text item in this row, ornullif there are noneabstract PageExtractor.TextgetSubText(int off, int len)Return a substring of this Text object as another Text objectabstract StringgetText()Return the text content of this textabstract intgetTextLength()Return the length of the String returned bygetText()abstract ShapegetVisualBounds()Return the visual bounds of the specified character in the string.abstract booleanisHorizontal()Indicates whether this text is horizontal or vertical.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface java.lang.Comparable
compareTo
-
-
-
-
Method Detail
-
getLength
public float getLength()
Return the length of this Text in points. This method measures the baseline of the text, so for rotated text the value will always be positive regardless of the angle.- Returns:
- the length of the text in points at its baseline
-
getCorners
public final float[] getCorners()
Return the four corners (x1,y1) (x2,y2) (x3,y3) (x4,y4) of the quadrilateral that encompasses the text. The order of these corners is as follows. For horizontal text: bottom-left, top-left, top-right, bottom-right. For vertical text: top-left, top-right, bottom-right, bottom-left. For horizontal text, the text baseline runs from (x1,y1) to (x4,y4).
-
createAnnotationMarkup
public AnnotationMarkup createAnnotationMarkup(String type)
Create a newAnnotationMarkupof the specified type to cover this text. The annotation is not added to the page- Parameters:
type- the type of markup - "Highlight", "Underline" etc.- Since:
- 2.8
-
getAngle
public final float getAngle()
Return the angle of rotation of this text on the page, in degrees clockwise from 12 o'clock. Most text is not rotated and so will return 0.- Returns:
- the angle of the text
-
getFontSize
public abstract float getFontSize()
Return the font size of this text in points
-
isHorizontal
public abstract boolean isHorizontal()
Indicates whether this text is horizontal or vertical. Note that vertical text will never be successfully positioned in the methods on this class that attempt to convert PDF text content into plain text.- Since:
- 2.18.3
-
getHorizontalScale
public abstract float getHorizontalScale()
Return an indication of the horizontal scale of the text. Typically this will be a value of 1; a value of 2 would mean the text had been stretched to double its natural width- Since:
- 2.18.1
-
getBaseline
public abstract float getBaseline()
Return the baseline of the text item, as a fraction between 0 and 1. 0 would indicate the baseline is at the top of the text, 1 at the absolute bottom. The value will normally be 0.8- Since:
- 2.11.7
-
getOffset
public abstract float getOffset(int pos)
Given an offset into the text, return the start position of that letter. Because text may not be on a horizontal line, this value is returned as a float in the range 0 to 1 (0 being at the start of the text, 1 being the end). For the common case where text is horizontal, you can calculate it's start position like so:float left = text.getCorners()[0] + (text.getOffset(pos) * text.getLength());
- Parameters:
pos- the position of the letter in the Text to retrive the position for. In the range 0 togetText().length() - 1- Since:
- 2.6.12
-
getEndOffset
public abstract float getEndOffset(int pos)
As forgetOffset()but return the end position of that letter- Since:
- 2.16.1
-
getPage
public PDFPage getPage()
Return thePDFPagethis text was found on - simply the page the parentPageExtractorwas created from.- Since:
- 2.6.12
-
getPageExtractor
public PageExtractor getPageExtractor()
Return thePageExtractorthis text was created from- Since:
- 2.10.3
-
getColor
public abstract Paint getColor()
Return the color of this text, ornullif none was set- Returns:
- the color
-
getLineColor
public abstract Paint getLineColor()
Return the outline color of this text, ornullif none was set- Returns:
- the outline color
- Since:
- 2.17.1
-
getFontName
public abstract String getFontName()
Return the font name of this text- Returns:
- the name of the font
-
getText
public abstract String getText()
Return the text content of this text- Returns:
- the text
-
getNormalizedText
public abstract String getNormalizedText()
Return a normalized form of the text, for text comparison purposes while searching. Normalization is done by converting toNFKDform and removing all diacritics.- Returns:
- the normalized text
-
getTextLength
public abstract int getTextLength()
Return the length of the String returned bygetText()- Since:
- 2.11.7
-
getRowNext
public abstract PageExtractor.Text getRowNext()
Return the next Text item in this row, ornullif there are none- Since:
- 2.10.3
-
getRowPrevious
public abstract PageExtractor.Text getRowPrevious()
Return the next Text item in this row, ornullif there are none- Since:
- 2.10.3
-
getFontMetaData
public abstract Reader getFontMetaData() throws IOException
Return any XMP MetaData that has been set on the Font, or
nullif none exists.Since 2.24.3, the returned type is guaranteed to hava a
toString()method that will return the Metadata as a String.- Throws:
IOException- Since:
- 2.11.6
- See Also:
PDF.getMetaData()
-
getSubText
public abstract PageExtractor.Text getSubText(int off, int len)
Return a substring of this Text object as another Text object- Parameters:
off- the offset into the textlen- the number of characters to return- Since:
- 2.11.7
-
getPrimaryText
public abstract PageExtractor.Text getPrimaryText()
If this text is a subtext or collection of Text object, return the primary text it starts with. If not, returnsnull- Since:
- 2.11.7
-
getPrimaryTextOffset
public abstract int getPrimaryTextOffset()
If this text is a subtext or collection of Text object, return the offset into theprimary textwhere it starts. If not, returns0- Since:
- 2.11.7
-
getByteLength
public abstract int getByteLength()
Get the length of the original text in bytes. This method is required because the Highlight File Format contains references to the byte offset into the string, not the character offset (as it states).- Since:
- 2.11.12
-
getByteToCharOffset
public abstract int getByteToCharOffset(int byteoffset)
Given a byte offset into the original String, return the Character offset it refers to.- Since:
- 2.11.12
- See Also:
getByteLength()
-
getVisualBounds
public abstract Shape getVisualBounds()
Return the visual bounds of the specified character in the string. This should be a rectangular shape which just clips the visual edges of the glyph. If the text is rotated, it will be a generic shape, but if the text is horizontal the shape will be a Rectangle2D object.- Since:
- 2.16.1
-
-