java.lang.Object
- org.faceless.pdf2.HtmlDerivation

public class HtmlDerivation
extends java.lang.Object

The HtmlDerivation class can be used to derive an HTML document from PDF. It is an implementation of the PDF Association's experimental Deriving HTML from PDF algorithm, and as such has the following limitations:

The input must be a tagged PDF - ideally compliant with PDF/UA-1 or PDF/UA-2
The output will match the requirements of that specification by default, although this may be changed with options

With a compliant PDF, usage is trivial. However the phrase "garbage in, garbage out" applies very strongly to this conversion. The input PDF is expected to be tagged correctly, using tags defined in the PDF specification, and even PDFs that meet this requirement will very seldom include the full set of style attributes. The more data that is included, the better the HTML derivation will be.

PDF pdf = ...
PDFParser parser = new PDFParser(pdf);
HtmlDerivation html = parser.getHtmlDerivation();
html.derive();
html.writeAsHTML(System.out);

The setOption(java.lang.String, java.lang.Object...) method can be used to control the derivation process. As the specifiation is expected to evolve over time the API is flexible and options are specified as strings. The current list of options can be retrieved by calling getOptions() - on a new HtmlDerivation object, these will be the default values. The full list of options available is:

layout-property-never-NNN	Never generate CSS properties from the specific PDF `Layout:` attribute, eg `layout-property-never-TextIndent` will prevent `Layout:TextIndent` from being converted into a CSS property.
layout-properties-never	Never generate CSS properties from PDF `Layout:` attributes
layout-properties-always	Always generate CSS properties from PDF `Layout:` attributes
layout-properties-if-no-stylesheet	Generate CSS properties from PDF `Layout:` attributes only if there are no stylesheets attached to the PDF
intrinsic-properties-never	Never generate CSS properties based on the style in use when the text is drawn on the page.
intrinsic-properties-always	Always generate CSS properties based on the style in use when the text is drawn on the page.
intrinsic-properties-if-no-stylesheet	Generate CSS properties from the style in use when the text is drawn on the page, but only if there are no stylesheets attached to the PDF
cascade-layers	Use an alternative method of styling elements with CSS cascade-layers.
override-list-style-type	Rather than the inline bullets currently recommended in the specification, use the normal CSS `list-style-type` attribute, possibly with a custom `@counter-style`, to style list bullets

Since:: 2.28.4

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method	Description
`void`	`derive()`	Derive an XHTML document DOM from the given PDF document.
`org.w3c.dom.Document`	`getDocument()`	Return the Document generated by `derive()`, or null if it hasn't been generated yet
`java.util.Collection<java.lang.String>`	`getEmbeddableMediaTypes()`	Return a list of Media Types that will be embedded directly into the output if found.
`java.util.List<java.lang.String>`	`getOptions()`	Return the list of currently specified options
`boolean`	`setOption(java.lang.String option, java.lang.Object... params)`	Set an option that can be used to configure the output of the Html Derivation algorithm.
`void`	`setOutputDirectory(java.io.File outputDirectory)`	Set the directory where the HtmlDerivation class should create files that are referenced from the HTML (currently, just images).
`void`	`writeAsHTML(java.lang.Appendable out)`	Write the specified document as HTML.
`void`	`writeAsXHTML(java.lang.Appendable out)`	Write the specified document as XHTML.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Method Detail
  - setOption
```
public boolean setOption(java.lang.String option,
                         java.lang.Object... params)
                  throws java.io.IOException
```
    Set an option that can be used to configure the output of the Html Derivation algorithm. The API is intentionally generic to allow options to be added (or removed) over time without breaking API compatibility. For a list of valid options see the class API docs.
    
    Parameters:
    
    option - the option
    
    params - an optional list of parameters, if required by the option
    
    Returns:
    
    true if the option was valid
    
    Throws:
    
    java.io.IOException
  - getOptions
```
public java.util.List<java.lang.String> getOptions()
```
    Return the list of currently specified options
  - getEmbeddableMediaTypes
```
public java.util.Collection<java.lang.String> getEmbeddableMediaTypes()
```
    Return a list of Media Types that will be embedded directly into the output if found. The returned Collection is live and can be modified.
    
    Returns:
    
    a modifiable list of strings representing media types.
  - setOutputDirectory
```
public void setOutputDirectory(java.io.File outputDirectory)
                        throws java.io.IOException
```
    Set the directory where the HtmlDerivation class should create files that are referenced from the HTML (currently, just images). If no directory is set, the images will be embedded as data URLs (the default)
    
    Parameters:
    
    outputDirectory - the directory to write files to, or null to use "data" URLs
    
    Throws:
    
    java.io.IOException
  - derive
```
public void derive()
```
    Derive an XHTML document DOM from the given PDF document. The derived CSS will be directly contained in a style element as opposed to a link. This document can then be serialized to HTML with a <!DOCTYPE html> doctype declaration as part of a serialization step if required, using UTF-8 charset as specified.
    
    Parameters:
    
    pdf - the PDF document to derive
  - getDocument
```
public org.w3c.dom.Document getDocument()
```
    Return the Document generated by derive(), or null if it hasn't been generated yet
    
    Returns:
    
    the document
  - writeAsXHTML
```
public void writeAsXHTML(java.lang.Appendable out)
                  throws java.io.IOException
```
    Write the specified document as XHTML.
    
    Parameters:
    
    out - the Appendable to write to
    
    Throws:
    
    java.io.IOException
  - writeAsHTML
```
public void writeAsHTML(java.lang.Appendable out)
                 throws java.io.IOException
```
    Write the specified document as HTML.
    
    Throws:
    
    java.io.IOException

Class HtmlDerivation

Method Summary

Methods inherited from class java.lang.Object

Method Detail

setOption

getOptions

getEmbeddableMediaTypes

setOutputDirectory

derive

getDocument

writeAsXHTML

writeAsHTML