Class HtmlDerivation
- java.lang.Object
-
- org.faceless.pdf2.HtmlDerivation
-
public class HtmlDerivation extends Object
The HtmlDerivation class can be used to derive an HTML document from PDF. It is an implementation of the PDF Association's experimental Deriving HTML from PDF algorithm, and as such has the following limitations:
- The input must be a tagged PDF - ideally compliant with PDF/UA-1 or PDF/UA-2
- The output will match the requirements of that specification by default, although this may be changed with options
With a compliant PDF, usage is trivial. However the phrase "garbage in, garbage out" applies very strongly to this conversion. The input PDF is expected to be tagged correctly, using tags defined in the PDF specification, and even PDFs that meet this requirement will very seldom include the full set of style attributes. The more data that is included, the better the HTML derivation will be.
PDF pdf = ... PDFParser parser = new PDFParser(pdf); HtmlDerivation html = parser.getHtmlDerivation(); html.derive(); html.writeAsHTML(System.out);
The
setOption(java.lang.String, java.lang.Object)method can be used to control the derivation process. As the specifiation is expected to evolve over time the API is flexible and options are specified as strings. The current list of options can be retrieved by calling#getOptions- on a new HtmlDerivation object, these will be the default values. The currently defined options are:Option Value Description pdf-* never | always | nostylesheetHow to extract CSS properties based on the physical font used on the page, such as the font-family, color and so on. "always" means always extract the style, "never" means never extract hte style and "nostylesheet" means only extract them if a stylesheet is not embedded in the PDF. Valid values for the key are pdf-font-family, pdf-font-size, pdf-font-weight, pdf-font-style, pdf-font-color, pdf-font-outline-colororpdf-font-*to set all of them at oncelayout-* never | always | nostylesheetHow to extract CSS properties based on any Layoutattributes in the Structure Tree. "always" means always extract the style, "never" means never extract hte style and "nostylesheet" means only extract them if a stylesheet is not embedded in the PDF. Valid values for the key are as defined in the PDF specification section 14.8.5.4, eglayout-placement, orlayout-*to set all of them at oncecss-* never | always | nostylesheetHow to extract CSS properties based on any CSSattributes in the Structure Tree. "always" means always extract the style, "never" means never extract the style and "nostylesheet" means only extract them if a stylesheet is not embedded in the PDF. Valid values are any CSS attribute, egcss-float, orcss-*to set all of them at oncecascade-layers true | falseWhether to use CSS cascade layers to group CSS attributes. We strongly recommend this is set, as it allows you to easily prioritise the various sources of "truth" for each CSS attribute use-list-style true | falseWhether to defined custom counter-style rules for custom lists. We generally recomment this is set, as it means any PDF lists are created as well-formed HTML lists with markers image-dpi integer between 1..1200When rasterizing an image in the PDF to an image in the HTML, what resolution to extract it at. The default is 200dpi image-mask-thickness integer between 0..5When rasterizing an image in the PDF to an image in the HTML and the image contains white pixels on the boundary, this value sets the thickness of the gray outline drawn around the image to make it visible on a white background. The default is 1 - Since:
- 2.28.4
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static interfaceHtmlDerivation.ResourceManagerAn interface used to manage external resources (eg images) that are referenced from the HTMLstatic interfaceHtmlDerivation.Test
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description voidderive()Derive an XHTML document DOM from the given PDF document.DocumentgetDocument()Return the Document generated byderive(), or null if it hasn't been generated yetCollection<String>getEmbeddableMediaTypes()Return a list of Media Types that will be embedded directly into the output if found.ObjectgetOption(String option)Return a previously set optionvoidsetOption(String key, Object value)Set an option that can be used to configure the output of the Html Derivation algorithm.voidsetResourceManager(HtmlDerivation.ResourceManager manager)Set theHtmlDerivation.ResourceManagerused to manage the URLs of external resources like images and stylesheets.voidwriteAsHTML(Appendable out)Write the specified document as HTML.voidwriteAsXHTML(Appendable out)Write the specified document as XHTML.
-
-
-
Method Detail
-
setOption
public void setOption(String key, Object value) throws IOException
Set an option that can be used to configure the output of the Html Derivation algorithm. The API is intentionally generic to allow options to be added (or removed) over time without breaking API compatibility. For a list of valid options see the class API docs. NOTE as the list of valid options is expected to change over time, we recommend catching exceptions thrown by this method and continuing.- Parameters:
option- the optionparams- an optional list of parameters, if required by the option- Throws:
IllegalArgumentException- if the option or value is not recognisedIOException
-
getOption
public Object getOption(String option)
Return a previously set option- Parameters:
option- the option- Returns:
- the value, or null if the option is unrecognised.
-
getEmbeddableMediaTypes
public Collection<String> getEmbeddableMediaTypes()
Return a list of Media Types that will be embedded directly into the output if found. The returned Collection is live and can be modified.- Returns:
- a modifiable list of strings representing media types.
-
setResourceManager
public void setResourceManager(HtmlDerivation.ResourceManager manager)
Set theHtmlDerivation.ResourceManagerused to manage the URLs of external resources like images and stylesheets. By default images are generated as data URLs and written directly to the HTML.- Parameters:
manager- the manager, which must not null. Default isHtmlDerivation.ResourceManager.INLINE- Since:
- 2.29.5
-
derive
public void derive()
Derive an XHTML document DOM from the given PDF document. The derived CSS will be directly contained in a style element as opposed to a link. This document can then be serialized to HTML with a <!DOCTYPE html> doctype declaration as part of a serialization step if required, using UTF-8 charset as specified.- Parameters:
pdf- the PDF document to derive
-
getDocument
public Document getDocument()
Return the Document generated byderive(), or null if it hasn't been generated yet- Returns:
- the document
-
writeAsXHTML
public void writeAsXHTML(Appendable out) throws IOException
Write the specified document as XHTML.- Parameters:
out- the Appendable to write to- Throws:
IOException
-
writeAsHTML
public void writeAsHTML(Appendable out) throws IOException
Write the specified document as HTML.- Throws:
IOException
-
-