Class PageExtractor.Text

  • All Implemented Interfaces:
    java.lang.Comparable<PageExtractor.Text>
    Enclosing class:
    PageExtractor

    public abstract class PageExtractor.Text
    extends java.lang.Object
    implements java.lang.Comparable<PageExtractor.Text>
    A class representing a piece of text which is extracted from the PageExtractor. Each text object has a location on the page, font-size, font-name, color and text.
    Since:
    2.6.2
    • Constructor Summary

      Constructors 
      Constructor Description
      Text()  
    • Method Summary

      All Methods Instance Methods Abstract Methods Concrete Methods 
      Modifier and Type Method Description
      AnnotationMarkup createAnnotationMarkup​(java.lang.String type)
      Create a new AnnotationMarkup of the specified type to cover this text.
      float getAngle()
      Return the angle of rotation of this text on the page, in degrees clockwise from 12 o'clock.
      abstract float getBaseline()
      Return the baseline of the text item, as a fraction between 0 and 1. 0 would indicate the baseline is at the top of the text, 1 at the absolute bottom.
      abstract int getByteLength()
      Get the length of the original text in bytes.
      abstract int getByteToCharOffset​(int byteoffset)
      Given a byte offset into the original String, return the Character offset it refers to.
      abstract java.awt.Paint getColor()
      Return the color of this text, or null if none was set
      float[] getCorners()
      Return the four corners (x1,y1) (x2,y2) (x3,y3) (x4,y4) of the quadrilateral that encompasses the text.
      abstract float getEndOffset​(int pos)
      As for getOffset() but return the end position of that letter
      abstract java.io.Reader getFontMetaData()
      Return any XMP MetaData that has been set on the Font, or null if none exists.
      abstract java.lang.String getFontName()
      Return the font name of this text
      abstract float getFontSize()
      Return the font size of this text in points
      abstract float getHorizontalScale()
      Return an indication of the horizontal scale of the text.
      float getLength()
      Return the length of this Text in points.
      abstract java.awt.Paint getLineColor()
      Return the outline color of this text, or null if none was set
      abstract java.lang.String getNormalizedText()
      Return a normalized form of the text, for text comparison purposes while searching.
      abstract float getOffset​(int pos)
      Given an offset into the text, return the start position of that letter.
      PDFPage getPage()
      Return the PDFPage this text was found on - simply the page the parent PageExtractor was created from.
      PageExtractor getPageExtractor()
      Return the PageExtractor this text was created from
      abstract PageExtractor.Text getPrimaryText()
      If this text is a subtext or collection of Text object, return the primary text it starts with.
      abstract int getPrimaryTextOffset()
      If this text is a subtext or collection of Text object, return the offset into the primary text where it starts.
      abstract PageExtractor.Text getRowNext()
      Return the next Text item in this row, or null if there are none
      abstract PageExtractor.Text getRowPrevious()
      Return the next Text item in this row, or null if there are none
      abstract PageExtractor.Text getSubText​(int off, int len)
      Return a substring of this Text object as another Text object
      abstract java.lang.String getText()
      Return the text content of this text
      abstract int getTextLength()
      Return the length of the String returned by getText()
      abstract java.awt.Shape getVisualBounds()
      Return the visual bounds of the specified character in the string.
      abstract boolean isHorizontal()
      Indicates whether this text is horizontal or vertical.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
      • Methods inherited from interface java.lang.Comparable

        compareTo
    • Constructor Detail

      • Text

        public Text()
    • Method Detail

      • getLength

        public float getLength()
        Return the length of this Text in points. This method measures the baseline of the text, so for rotated text the value will always be positive regardless of the angle.
        Returns:
        the length of the text in points at its baseline
      • getCorners

        public final float[] getCorners()
        Return the four corners (x1,y1) (x2,y2) (x3,y3) (x4,y4) of the quadrilateral that encompasses the text. The order of these corners is as follows. For horizontal text: bottom-left, top-left, top-right, bottom-right. For vertical text: top-left, top-right, bottom-right, bottom-left. For horizontal text, the text baseline runs from (x1,y1) to (x4,y4).
      • createAnnotationMarkup

        public AnnotationMarkup createAnnotationMarkup​(java.lang.String type)
        Create a new AnnotationMarkup of the specified type to cover this text. The annotation is not added to the page
        Parameters:
        type - the type of markup - "Highlight", "Underline" etc.
        Since:
        2.8
      • getAngle

        public final float getAngle()
        Return the angle of rotation of this text on the page, in degrees clockwise from 12 o'clock. Most text is not rotated and so will return 0.
        Returns:
        the angle of the text
      • getFontSize

        public abstract float getFontSize()
        Return the font size of this text in points
      • isHorizontal

        public abstract boolean isHorizontal()
        Indicates whether this text is horizontal or vertical. Note that vertical text will never be successfully positioned in the methods on this class that attempt to convert PDF text content into plain text.
        Since:
        2.18.3
      • getHorizontalScale

        public abstract float getHorizontalScale()
        Return an indication of the horizontal scale of the text. Typically this will be a value of 1; a value of 2 would mean the text had been stretched to double its natural width
        Since:
        2.18.1
      • getBaseline

        public abstract float getBaseline()
        Return the baseline of the text item, as a fraction between 0 and 1. 0 would indicate the baseline is at the top of the text, 1 at the absolute bottom. The value will normally be 0.8
        Since:
        2.11.7
      • getOffset

        public abstract float getOffset​(int pos)
        Given an offset into the text, return the start position of that letter. Because text may not be on a horizontal line, this value is returned as a float in the range 0 to 1 (0 being at the start of the text, 1 being the end). For the common case where text is horizontal, you can calculate it's start position like so:
         float left = text.getCorners()[0] + (text.getOffset(pos) * text.getLength());
         
        Parameters:
        pos - the position of the letter in the Text to retrive the position for. In the range 0 to getText().length() - 1
        Since:
        2.6.12
      • getEndOffset

        public abstract float getEndOffset​(int pos)
        As for getOffset() but return the end position of that letter
        Since:
        2.16.1
      • getPage

        public PDFPage getPage()
        Return the PDFPage this text was found on - simply the page the parent PageExtractor was created from.
        Since:
        2.6.12
      • getColor

        public abstract java.awt.Paint getColor()
        Return the color of this text, or null if none was set
        Returns:
        the color
      • getLineColor

        public abstract java.awt.Paint getLineColor()
        Return the outline color of this text, or null if none was set
        Returns:
        the outline color
        Since:
        2.17.1
      • getFontName

        public abstract java.lang.String getFontName()
        Return the font name of this text
        Returns:
        the name of the font
      • getText

        public abstract java.lang.String getText()
        Return the text content of this text
        Returns:
        the text
      • getNormalizedText

        public abstract java.lang.String getNormalizedText()
        Return a normalized form of the text, for text comparison purposes while searching. Normalization is done by converting to NFKD form and removing all diacritics.
        Returns:
        the normalized text
      • getTextLength

        public abstract int getTextLength()
        Return the length of the String returned by getText()
        Since:
        2.11.7
      • getRowNext

        public abstract PageExtractor.Text getRowNext()
        Return the next Text item in this row, or null if there are none
        Since:
        2.10.3
      • getRowPrevious

        public abstract PageExtractor.Text getRowPrevious()
        Return the next Text item in this row, or null if there are none
        Since:
        2.10.3
      • getFontMetaData

        public abstract java.io.Reader getFontMetaData()
                                                throws java.io.IOException

        Return any XMP MetaData that has been set on the Font, or null if none exists.

        Since 2.24.3, the returned type is guaranteed to hava a toString() method that will return the Metadata as a String.

        Throws:
        java.io.IOException
        Since:
        2.11.6
        See Also:
        PDF.getMetaData()
      • getSubText

        public abstract PageExtractor.Text getSubText​(int off,
                                                      int len)
        Return a substring of this Text object as another Text object
        Parameters:
        off - the offset into the text
        len - the number of characters to return
        Since:
        2.11.7
      • getPrimaryText

        public abstract PageExtractor.Text getPrimaryText()
        If this text is a subtext or collection of Text object, return the primary text it starts with. If not, returns null
        Since:
        2.11.7
      • getPrimaryTextOffset

        public abstract int getPrimaryTextOffset()
        If this text is a subtext or collection of Text object, return the offset into the primary text where it starts. If not, returns 0
        Since:
        2.11.7
      • getByteLength

        public abstract int getByteLength()
        Get the length of the original text in bytes. This method is required because the Highlight File Format contains references to the byte offset into the string, not the character offset (as it states).
        Since:
        2.11.12
      • getByteToCharOffset

        public abstract int getByteToCharOffset​(int byteoffset)
        Given a byte offset into the original String, return the Character offset it refers to.
        Since:
        2.11.12
        See Also:
        getByteLength()
      • getVisualBounds

        public abstract java.awt.Shape getVisualBounds()
        Return the visual bounds of the specified character in the string. This should be a rectangular shape which just clips the visual edges of the glyph. If the text is rotated, it will be a generic shape, but if the text is horizontal the shape will be a Rectangle2D object.
        Since:
        2.16.1