Class PDFParser
- java.lang.Object
-
- org.faceless.pdf2.PDFParser
-
- All Implemented Interfaces:
Pageable
public class PDFParser extends Object implements Pageable
The
PDFParser
class can be used to parse the contents of a PDF document, for example converting it to an Image, writing to TIFF, printing it and so on. Typically you will either usePDFParser
directly when working on the whole document (for instance, to save the PDF as a multi-page TIFF), or will use it to get aPagePainter
object for parsing individual pages or aPageExtractor
object, to extract text and images from a specific page.Note that this class is part of the "Viewer Extension" of the library - although it's supplied with the package an "viewer extension" license must be purchased to activate this class. While the library is unlicensed this class may be used freely, although a "DEMO" stamp will be applied to each document.
This class implementsPageable
, which means it can be printed directly using thePrinterJob.setPageable()
method.- Since:
- 2.5
-
-
Field Summary
Fields Modifier and Type Field Description static ColorModel
BLACKANDWHITE
AColorModel
that can be passed in towriteAsTIFF()
or the variousPagePainter
methods which represent a 1-bit black and white color model.static ColorModel
CMYK
AColorModel
that can be passed in towriteAsTIFF()
of the variousPagePainter
methods which represent an opaque CMYK color model.static ColorModel
GRAYSCALE
AColorModel
that can be passed in towriteAsTIFF()
of the variousPagePainter
methods which represent an opaque grayscale color modelstatic ColorModel
RGB
AColorModel
that can be passed in towriteAsTIFF()
of the variousPagePainter
methods which represent an opaque RGB color model.static ColorModel
RGBA
AColorModel
that can be passed in towriteAsTIFF()
of the variousPagePainter
methods which represent a translucent RGB color model with an alpha component.static ColorModel
SEPARATIONS
AColorModel
that can be passed toPagePainter.getImage(float)
.-
Fields inherited from interface java.awt.print.Pageable
UNKNOWN_NUMBER_OF_PAGES
-
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static ColorModel
getBlackAndWhiteColorModel(int threshold)
Return a Black and WhiteColorModel
that ensures that any colours below the specified threshold are converted to black.static ColorModel
getBlackAndWhiteDitheredColorModel()
Return a Black and WhiteColorModel
that performs dithering on pixels.HtmlDerivation
getHtmlDerivation()
Return a newHtmlDerivation
based on this PDFParserorg.apache.lucene.document.Document
getLuceneDocument(boolean createall, boolean createbody, boolean createpages)
Create aDocument
object for indexing the PDF with the Apache Lucene full-text indexing library.int
getNumberOfPages()
Return the number of pages in the document being parsed.PageExtractor
getPageExtractor(int pagenumber)
Returns aPageExtractor
for the specified page number.PageExtractor
getPageExtractor(PDFPage page)
Returns aPageExtractor
for the specified page.List<PageExtractor>
getPageExtractors()
Get a list containining all the PageExtractors for this PDF, in order.PageFormat
getPageFormat(int pagenumber)
Returns thePageFormat
for the specified page.PagePainter
getPagePainter(int pagenumber)
Returns aPagePainter
for the specified page number.PagePainter
getPagePainter(PDFPage page)
Returns aPagePainter
for the specified page.PDF
getPDF()
Return the PDF this PDFParser is built from.Printable
getPrintable(int pagenumber)
Returns thePrintable
interface for a page.Document
getStructureTree()
Returns the Structure Tree for the entire document as a W3C Document.float
getWriteAsTIFFProgress()
Get the progress of thewriteAsTIFF()
method running in a different thread.boolean
isExtractable()
Return true if this PDF allows it's text and/or images to be extracted by calling thegetPageExtractor(int)
method.boolean
isPrintable()
Return true if this PDF is allowed to be printed.void
resetPageExtractor(PDFPage page)
Reset the previously created PageExtractor.void
setFont(String fontname, Object font)
Specify a font substitution to use.void
setOutputProfile(OutputProfile profile)
Set the OutputProfile which should be updated for any extraction or rendering performed with this PDFParser.void
setPrintAsImageResolution(int dpi)
When printing a PDF via this classesPageable
interface, it can sometimes be useful to force the PDF to print as an image at a specific resolution.void
writeAsTIFF(OutputStream out, int dpi, ColorModel model)
Convert the PDF to a TIFF image using the specified ColorModel and dots per inch.void
writeAsTIFF(OutputStream out, int dpi, ColorModel model, RenderingHints hints)
As forwriteAsTIFF(OutputStream,int,ColorModel)
but allows the user to setRenderingHints
to control the rendering process.
-
-
-
Field Detail
-
BLACKANDWHITE
public static final ColorModel BLACKANDWHITE
AColorModel
that can be passed in towriteAsTIFF()
or the variousPagePainter
methods which represent a 1-bit black and white color model. When writing TIFF images however, we recommend using a model returnedgetBlackAndWhiteColorModel(int)
instead of this model, as they're much faster.- See Also:
getBlackAndWhiteColorModel(int)
-
GRAYSCALE
public static final ColorModel GRAYSCALE
AColorModel
that can be passed in towriteAsTIFF()
of the variousPagePainter
methods which represent an opaque grayscale color model
-
RGB
public static final ColorModel RGB
AColorModel
that can be passed in towriteAsTIFF()
of the variousPagePainter
methods which represent an opaque RGB color model.
-
RGBA
public static final ColorModel RGBA
AColorModel
that can be passed in towriteAsTIFF()
of the variousPagePainter
methods which represent a translucent RGB color model with an alpha component. TIFFs created this way will have a transparent background.- Since:
- 2.5.2
-
CMYK
public static final ColorModel CMYK
AColorModel
that can be passed in towriteAsTIFF()
of the variousPagePainter
methods which represent an opaque CMYK color model.- Since:
- 2.5.2
-
SEPARATIONS
public static final ColorModel SEPARATIONS
A
ColorModel
that can be passed toPagePainter.getImage(float)
. The returned image will be in aDeviceNColorSpace
based on CMYK, but also containing a channel for any spot colors that are used in the PDF. This image can be converted to RGB for a "print preview" using theDeviceNColorSpace.getColorConvertOp
method, or the image can be converted to aPDFImage
for embedding into a PDF.This ColorModel can also be passed to
writeAsTIFF()
, but if the returned image contains more than four color components (i.e. it has Spot colors) then while the resulting file will be technically valid - the TIFF format allows multi-channel images - it will be unviewable, as the TIFF will not include a color-space specifying how to convert the N-channels to XYZ or RGB.- Since:
- 2.28.3
- See Also:
DeviceNColorSpace.getColorConvertOp(java.awt.color.ColorSpace, java.lang.Object...)
-
-
Method Detail
-
getPDF
public final PDF getPDF()
Return the PDF this PDFParser is built from.- Since:
- 2.11.3
-
getPagePainter
public PagePainter getPagePainter(int pagenumber)
Returns aPagePainter
for the specified page number. Just callsgetPagePainter(pdf.getPage(pagenumber))
- Parameters:
pagenumber
- the page to select, from 0 toPDF.getNumberOfPages()
- Returns:
- a PagePainter for the specified page
-
getPagePainter
public PagePainter getPagePainter(PDFPage page)
Returns aPagePainter
for the specified page.- Parameters:
page
- the PDFPage to select- Returns:
- a PagePainter for the specified page
- Since:
- 2.7.1
-
getPageExtractor
public PageExtractor getPageExtractor(int pagenumber)
Returns aPageExtractor
for the specified page number. Just callsgetPageExtractor(pdf.getPage(pagenumber))
- Parameters:
pagenumber
- the page to select, from 0 toPDF.getNumberOfPages()
- Returns:
- a PageExtractor for the specified page
- Since:
- 2.6.1
- See Also:
isExtractable()
-
getPageExtractor
public PageExtractor getPageExtractor(PDFPage page)
Returns aPageExtractor
for the specified page. If the PDF does not allow extraction, throws aSecurityException
- Parameters:
page
- the page to select.- Returns:
- a PageExtractor for the specified page
- Since:
- 2.7.1
- See Also:
isExtractable()
-
resetPageExtractor
public void resetPageExtractor(PDFPage page)
Reset the previously created PageExtractor. This will only need to be done if that page has had its content altered, ie by appending to it or by changing its orientation.- Since:
- 2.11.7
-
getPageExtractors
public List<PageExtractor> getPageExtractors()
Get a list containining all the PageExtractors for this PDF, in order. This is not a particularly expensive operation as the extraction is not run when the extractor is created.- Since:
- 2.11.7
-
getStructureTree
public Document getStructureTree()
Returns the Structure Tree for the entire document as a W3C Document. As of 2.24, this is simply an alias for the following:
Document doc = pdf.getStructureTree(); doc.getDomConfig().setParameter("extract-text", true); return doc
- Returns:
- the document-wide Structure Tree.
- Since:
- 2.19
- See Also:
PDF.getStructureTree()
-
setFont
public void setFont(String fontname, Object font)
Specify a font substitution to use. For unembedded fonts, the library must choose a substitute font to render the glyphs. Typically the heuristics used are quite effective, but occasionally (particularly with east-asian fonts) this may need to be overridden. This method allows you to specify the mapping from a PDF font name to an AWT font, overriding the heuristics.- Parameters:
fontname
- the name of the font used in the PDFfont
- the Font to use - either aFont
or anOpenTypeFont
- Since:
- 2.7.7 (since 2.11.17 the second parameter can also be an OpenTypeFont)
-
setPrintAsImageResolution
public void setPrintAsImageResolution(int dpi)
When printing a PDF via this classesPageable
interface, it can sometimes be useful to force the PDF to print as an image at a specific resolution. This method can be called to set that resolution - the default value is 0 which means the file will not be printed as an image. Any other value will cause the page being printed to be converted to a bitmap to that resolution before printing. Suggested values are between 150 and 600.- Since:
- 2.16.4
-
isPrintable
public boolean isPrintable()
Return true if this PDF is allowed to be printed. Since 2.8.2 this method simply returns the value ofEncryptionHandler.hasRight("Print")
- Returns:
- true if the document is allowed to be printed
-
isExtractable
public boolean isExtractable()
Return true if this PDF allows it's text and/or images to be extracted by calling thegetPageExtractor(int)
method. PDF's may optionally be encrypted to prevent this - see theStandardEncryptionHandler
class for more information. Since 2.8.2 this method simply returns the value ofEncryptionHandler.hasRight("Extract")
- Returns:
- true if the document can have its text and/or images extracted.
-
writeAsTIFF
public void writeAsTIFF(OutputStream out, int dpi, ColorModel model) throws IOException
Convert the PDF to a TIFF image using the specified ColorModel and dots per inch. For example, to convert the PDF to a black and white TIFF, try:
PDFParser parser = new PDFParser(pdf); FileOutputStream out = new FileOutputStream("out.tif"); parser.writeAsTIFF(out, 72, PDFParser.BLACKANDWHITE); out.close();
The ColorModel determines what type of TIFF is created and what sort of compression is used. For instance, passing in a 2-bit black & white model will result in a black & white TIFF compressed with CCITT Group 4 compression. If the specified model returns
Transparency.TRANSLUCENT
fromColorModel.getTransparency()
then the TIFF will be written with alpha values and created with a transparent background, otherwise the TIFF will have a white background set and will be written without alpha-values. Note that specifying a model that doesn't match the model of the PDF causes color conversions to be applied, which can be quite a slow process.You can create TIFF images that have less then all the pages of the PDF by manipulating the the PDF's page list before saving. Say for example you want to create 10 single-page TIFF images from your 10-page PDF document. Here's how:
List copy = new ArrayList(pdf.getPages()); for (int i=0;i<copy.size();i++) { pdf.getPages().clear(); pdf.getPages().add(copy.get(i)); pdf.writeAsTIFF(out[i], dpi, model); }
Parallel Operation Note: Since 2.10, this method can optionally run multiple threads in parallel to speed up writing. To enable this, set the
Threads.TIFF
property
(typically by setting theorg.faceless.pdf2.Threads.TIFF
System property
) to the number of threads you want to use. Note that each thread may require significant amount of memory - how much depends on the content of each page, so it's very difficult to determine in advance. Carefully tune this value yourself based on the amount of memory in your system and the type of documents you're working with in order to avoid anOutOfMemoryError
.- Parameters:
out
- The OutputStream to write the TIFF to. The stream will be left open on completiondpi
- how many dots per inch to view the page. A value of 72 gives in 1 point per pixel. As a special hack for those creating Class F TIFF images, a DPI of -1 gives a 204x196 DPI image and -2 gives 204x96 DPI (these added in 2.6.9).model
- the ColorModel to use to render the images.- Throws:
IOException
- if an exception is encountered when writing the TIFF- See Also:
BLACKANDWHITE
,RGB
,CMYK
,getBlackAndWhiteColorModel(int)
-
writeAsTIFF
public void writeAsTIFF(OutputStream out, int dpi, ColorModel model, RenderingHints hints) throws IOException
As forwriteAsTIFF(OutputStream,int,ColorModel)
but allows the user to setRenderingHints
to control the rendering process.- Parameters:
out
- The OutputStream to write the TIFF to. The stream will be left open on completiondpi
- how many dots per inch to view the page. A value of 72 gives in 1 point per pixel. As a special hack for those creating Class F TIFF images, a DPI of -1 gives a 204x196 DPI image and -2 gives 204x96 DPI (these added in 2.6.9).model
- the ColorModel to use to render the images.hints
- the RenderingHints to be used when rendering the image, ornull
to use the defaults.- Throws:
IOException
- if an exception is encountered when writing the TIFF- Since:
- 2.6.3
- See Also:
BLACKANDWHITE
,RGB
,CMYK
,getBlackAndWhiteColorModel(int)
-
getWriteAsTIFFProgress
public float getWriteAsTIFFProgress()
Get the progress of thewriteAsTIFF()
method running in a different thread. The returned value will start at 0 and move towards 1 as the write progresses.- Since:
- 2.8
-
setOutputProfile
public void setOutputProfile(OutputProfile profile)
Set the OutputProfile which should be updated for any extraction or rendering performed with this PDFParser. This will not give the full PDF OutputProfile (for that you should callPDF.getFullOutputProfile()
) but it can be used to determine some of which features apply to particular pages.- Since:
- 2.11.25
-
getNumberOfPages
public int getNumberOfPages()
Return the number of pages in the document being parsed. Needed for thePageable
interface, this method just callsPDF.getNumberOfPages()
- Specified by:
getNumberOfPages
in interfacePageable
- Returns:
- the number of pages in the document being parsed
-
getPageFormat
public PageFormat getPageFormat(int pagenumber)
Returns thePageFormat
for the specified page.- Specified by:
getPageFormat
in interfacePageable
- Parameters:
pagenumber
- the page to select, from 0 toPDF.getNumberOfPages()
- Returns:
- the
PageFormat
for page at indexpagenumber
-
getPrintable
public Printable getPrintable(int pagenumber)
Returns thePrintable
interface for a page. Needed for thePageable
interface, this method just callsgetPagePainter(int)
- Specified by:
getPrintable
in interfacePageable
- Parameters:
pagenumber
- the page to select, from 0 toPDF.getNumberOfPages()
- Returns:
- the
Printable
object for the specified page
-
getLuceneDocument
public org.apache.lucene.document.Document getLuceneDocument(boolean createall, boolean createbody, boolean createpages)
Create a
Document
object for indexing the PDF with the Apache Lucene full-text indexing library. The Document is created withField
objects representing the content of the PDF, the info dictionary, the form and any annotations that may be there. The fields are called:body The contents of all the pages in the PDF page.n The contents of page n of the PDF info.field The contents of the field field of the Info dictionary - eg. info.Title
info The contents of the whole Info dictionary as one item form.field The contents of the field field of the Form form The contents of the whole Form as one item annotations The contents of all the annotations in the document as one item all All the fields above concatenated into one big field - useful for searching the entire textual content of the PDF in one go Because creating indices for
all
,body
andpage.n
is usually redundant (typically you will want only one of them), they can be turned on or off individually by setting the appropriate parameter totrue
orfalse
.- Parameters:
createall
- whether to create anall
entry in the indexcreatebody
- whether to create anbody
entry in the indexcreatepages
- whether to create thepage.n
entries in the index- Returns:
- a
Document
suitable for indexing with Lucene. - Since:
- 2.6.2
-
getBlackAndWhiteColorModel
public static ColorModel getBlackAndWhiteColorModel(int threshold)
Return a Black and White
ColorModel
that ensures that any colours below the specified threshold are converted to black. This method can be used to convert images that have shades of gray to black and white TIFF images - because it renders the PDF to RGB before manually converting it to Black and White it avoids some of the platform dependent behaviour that arises from usingBLACKANDWHITE
, and will probably run faster on many operating systems.- Parameters:
threshold
- a number between 0 and 255 - typically around 128 or so. Higher values result in more black.Since 2.11.17 the value "0" can be used to automatically determine the threshold value using Otsu's algorithm. This may be appropriate for poor quality images.
Note this ColorModel should only by used in the
writeAsTIFF(java.io.OutputStream, int, java.awt.image.ColorModel)
method - passing it into one of thePagePainter.getImage
methods will not work- Since:
- 2.6.8
- See Also:
BLACKANDWHITE
,writeAsTIFF(java.io.OutputStream, int, java.awt.image.ColorModel)
-
getBlackAndWhiteDitheredColorModel
public static ColorModel getBlackAndWhiteDitheredColorModel()
Return a Black and White
ColorModel
that performs dithering on pixels. Note this ColorModel should only by used in thewriteAsTIFF(java.io.OutputStream, int, java.awt.image.ColorModel)
method - passing it into one of thePagePainter.getImage
methods will not work- Since:
- 2.18
- See Also:
BLACKANDWHITE
,writeAsTIFF(java.io.OutputStream, int, java.awt.image.ColorModel)
-
getHtmlDerivation
public HtmlDerivation getHtmlDerivation()
Return a newHtmlDerivation
based on this PDFParser- Returns:
- the HtmlDerivation
- Since:
- 2.28.4
-
-