Package org.faceless.pdf2
Class PageExtractor.Text
- java.lang.Object
-
- org.faceless.pdf2.PageExtractor.Text
-
- All Implemented Interfaces:
Comparable<PageExtractor.Text>
- Enclosing class:
- PageExtractor
public abstract class PageExtractor.Text extends Object implements Comparable<PageExtractor.Text>
A class representing a piece of text which is extracted from thePageExtractor
. Each text object has a location on the page, font-size, font-name, color and text.- Since:
- 2.6.2
-
-
Constructor Summary
Constructors Constructor Description Text()
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description AnnotationMarkup
createAnnotationMarkup(String type)
Create a newAnnotationMarkup
of the specified type to cover this text.float
getAngle()
Return the angle of rotation of this text on the page, in degrees clockwise from 12 o'clock.abstract float
getBaseline()
Return the baseline of the text item, as a fraction between 0 and 1. 0 would indicate the baseline is at the top of the text, 1 at the absolute bottom.abstract int
getByteLength()
Get the length of the original text in bytes.abstract int
getByteToCharOffset(int byteoffset)
Given a byte offset into the original String, return the Character offset it refers to.abstract Paint
getColor()
Return the color of this text, ornull
if none was setfloat[]
getCorners()
Return the four corners (x1,y1) (x2,y2) (x3,y3) (x4,y4) of the quadrilateral that encompasses the text.abstract float
getEndOffset(int pos)
As forgetOffset()
but return the end position of that letterabstract Reader
getFontMetaData()
Return any XMP MetaData that has been set on the Font, ornull
if none exists.abstract String
getFontName()
Return the font name of this textabstract float
getFontSize()
Return the font size of this text in pointsabstract float
getHorizontalScale()
Return an indication of the horizontal scale of the text.float
getLength()
Return the length of this Text in points.abstract Paint
getLineColor()
Return the outline color of this text, ornull
if none was setabstract String
getNormalizedText()
Return a normalized form of the text, for text comparison purposes while searching.abstract float
getOffset(int pos)
Given an offset into the text, return the start position of that letter.PDFPage
getPage()
Return thePDFPage
this text was found on - simply the page the parentPageExtractor
was created from.PageExtractor
getPageExtractor()
Return thePageExtractor
this text was created fromabstract PageExtractor.Text
getPrimaryText()
If this text is a subtext or collection of Text object, return the primary text it starts with.abstract int
getPrimaryTextOffset()
If this text is a subtext or collection of Text object, return the offset into theprimary text
where it starts.abstract PageExtractor.Text
getRowNext()
Return the next Text item in this row, ornull
if there are noneabstract PageExtractor.Text
getRowPrevious()
Return the next Text item in this row, ornull
if there are noneabstract PageExtractor.Text
getSubText(int off, int len)
Return a substring of this Text object as another Text objectabstract String
getText()
Return the text content of this textabstract int
getTextLength()
Return the length of the String returned bygetText()
abstract Shape
getVisualBounds()
Return the visual bounds of the specified character in the string.abstract boolean
isHorizontal()
Indicates whether this text is horizontal or vertical.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface java.lang.Comparable
compareTo
-
-
-
-
Method Detail
-
getLength
public float getLength()
Return the length of this Text in points. This method measures the baseline of the text, so for rotated text the value will always be positive regardless of the angle.- Returns:
- the length of the text in points at its baseline
-
getCorners
public final float[] getCorners()
Return the four corners (x1,y1) (x2,y2) (x3,y3) (x4,y4) of the quadrilateral that encompasses the text. The order of these corners is as follows. For horizontal text: bottom-left, top-left, top-right, bottom-right. For vertical text: top-left, top-right, bottom-right, bottom-left. For horizontal text, the text baseline runs from (x1,y1) to (x4,y4).
-
createAnnotationMarkup
public AnnotationMarkup createAnnotationMarkup(String type)
Create a newAnnotationMarkup
of the specified type to cover this text. The annotation is not added to the page- Parameters:
type
- the type of markup - "Highlight", "Underline" etc.- Since:
- 2.8
-
getAngle
public final float getAngle()
Return the angle of rotation of this text on the page, in degrees clockwise from 12 o'clock. Most text is not rotated and so will return 0.- Returns:
- the angle of the text
-
getFontSize
public abstract float getFontSize()
Return the font size of this text in points
-
isHorizontal
public abstract boolean isHorizontal()
Indicates whether this text is horizontal or vertical. Note that vertical text will never be successfully positioned in the methods on this class that attempt to convert PDF text content into plain text.- Since:
- 2.18.3
-
getHorizontalScale
public abstract float getHorizontalScale()
Return an indication of the horizontal scale of the text. Typically this will be a value of 1; a value of 2 would mean the text had been stretched to double its natural width- Since:
- 2.18.1
-
getBaseline
public abstract float getBaseline()
Return the baseline of the text item, as a fraction between 0 and 1. 0 would indicate the baseline is at the top of the text, 1 at the absolute bottom. The value will normally be 0.8- Since:
- 2.11.7
-
getOffset
public abstract float getOffset(int pos)
Given an offset into the text, return the start position of that letter. Because text may not be on a horizontal line, this value is returned as a float in the range 0 to 1 (0 being at the start of the text, 1 being the end). For the common case where text is horizontal, you can calculate it's start position like so:float left = text.getCorners()[0] + (text.getOffset(pos) * text.getLength());
- Parameters:
pos
- the position of the letter in the Text to retrive the position for. In the range 0 togetText().length() - 1
- Since:
- 2.6.12
-
getEndOffset
public abstract float getEndOffset(int pos)
As forgetOffset()
but return the end position of that letter- Since:
- 2.16.1
-
getPage
public PDFPage getPage()
Return thePDFPage
this text was found on - simply the page the parentPageExtractor
was created from.- Since:
- 2.6.12
-
getPageExtractor
public PageExtractor getPageExtractor()
Return thePageExtractor
this text was created from- Since:
- 2.10.3
-
getColor
public abstract Paint getColor()
Return the color of this text, ornull
if none was set- Returns:
- the color
-
getLineColor
public abstract Paint getLineColor()
Return the outline color of this text, ornull
if none was set- Returns:
- the outline color
- Since:
- 2.17.1
-
getFontName
public abstract String getFontName()
Return the font name of this text- Returns:
- the name of the font
-
getText
public abstract String getText()
Return the text content of this text- Returns:
- the text
-
getNormalizedText
public abstract String getNormalizedText()
Return a normalized form of the text, for text comparison purposes while searching. Normalization is done by converting toNFKD
form and removing all diacritics.- Returns:
- the normalized text
-
getTextLength
public abstract int getTextLength()
Return the length of the String returned bygetText()
- Since:
- 2.11.7
-
getRowNext
public abstract PageExtractor.Text getRowNext()
Return the next Text item in this row, ornull
if there are none- Since:
- 2.10.3
-
getRowPrevious
public abstract PageExtractor.Text getRowPrevious()
Return the next Text item in this row, ornull
if there are none- Since:
- 2.10.3
-
getFontMetaData
public abstract Reader getFontMetaData() throws IOException
Return any XMP MetaData that has been set on the Font, or
null
if none exists.Since 2.24.3, the returned type is guaranteed to hava a
toString()
method that will return the Metadata as a String.- Throws:
IOException
- Since:
- 2.11.6
- See Also:
PDF.getMetaData()
-
getSubText
public abstract PageExtractor.Text getSubText(int off, int len)
Return a substring of this Text object as another Text object- Parameters:
off
- the offset into the textlen
- the number of characters to return- Since:
- 2.11.7
-
getPrimaryText
public abstract PageExtractor.Text getPrimaryText()
If this text is a subtext or collection of Text object, return the primary text it starts with. If not, returnsnull
- Since:
- 2.11.7
-
getPrimaryTextOffset
public abstract int getPrimaryTextOffset()
If this text is a subtext or collection of Text object, return the offset into theprimary text
where it starts. If not, returns0
- Since:
- 2.11.7
-
getByteLength
public abstract int getByteLength()
Get the length of the original text in bytes. This method is required because the Highlight File Format contains references to the byte offset into the string, not the character offset (as it states).- Since:
- 2.11.12
-
getByteToCharOffset
public abstract int getByteToCharOffset(int byteoffset)
Given a byte offset into the original String, return the Character offset it refers to.- Since:
- 2.11.12
- See Also:
getByteLength()
-
getVisualBounds
public abstract Shape getVisualBounds()
Return the visual bounds of the specified character in the string. This should be a rectangular shape which just clips the visual edges of the glyph. If the text is rotated, it will be a generic shape, but if the text is horizontal the shape will be a Rectangle2D object.- Since:
- 2.16.1
-
-