Home Download << Previous | Home RSS feed

Text Extraction Using BFOs PDF Library

How to extract text from a PDF using BFO's PDF Library API. We will show you with code examples of how it can be done.

Read more...

New features in the PDF Library 2.15

It's been 4 months since our last PDF API release, what what does it have in store? Besides changes to the page list, there are two major new areas:

  • PDF/A-2 and PDF/A-3 support has been added
  • The Swing classes now support linearized loading

New PDF/A revisions

We're seeing more and more companies adopt ISO 19005, aka PDF/A, and we're pleased to have added support for revisions 2 and 3 of the specification. Of course there's no need to change if you're already targeting PDF/A-1 - if you're not familiar with the new revision than this is a good summary. But for those that need the new features allowed in the later revisions of the specification then this new release is for you. We think the most significant are:

  • Embedded files are now allowed: those files must also be PDF/A for PDF/A-2, but this is relaxed in PDF/A-3
  • JPEG2000 compression is now allowed
  • Transparency is now allowed

Currently we only support creation of PDF/A-2b and PDF/A-3b documents, but support for the "U" variation (for Unicode) will be in an upcoming release.

Linearization support in the viewer

For some customers, this is the big one. Linearized documents are designed to be displayable before the entire document has been downloaded, but although we added support for this to the core API in the previous release 2.14, it took until now to get this added to the viewer. It's a complex change, because it invalidates some previous assumptions (namely that pdf.getPage) will return immediately).

The good news it's in in and working, and for a demonstration point your web-browser to our example applet and select the 12MB "Linearized Example" from the drop-down list above it. The first page should show within a few seconds, but if you check the title bar you'll see a percentage showing how much of the document is actually downloaded.

How to take advantage of Linearization

To make use of this new feature there's actually very little you need to do. The PDF viewer will do this automatically if the following conditions are met:
  1. The PDF you're loading has to be linearized - probably goes without saying, but we'll say it anyway. Our PDF Library has been able to create linearised PDFs for a long time, and of course most other tools can create them too - they're variously called "Web Ready" or "Optimized" PDF in Acrobat.
  2. The PDF must be loaded from an HTTP or HTTPS URL. Our viewer has an API method to do this: PDFViewer.loadPDF(URL) - and if you're using the viewer as an applet you can do this by specifying the URL (relative or absolute) with the pdf parameter to the applet. See the PDFViewerApplet applet for details.
  3. The web-server serving the PDF must support the Range HTTP header in requests, and it must advertise this by adding Accept-Ranges: bytes in the initial response. Most do, if the file is a static file and being served from the filesystem by the default method.

    If you've got your own servlet which is serving the files, as you might if they were loaded from a database for instance, then you need to make sure you've implemented this. Your servlet will see an initial request for the PDF, and if the PDF is linearized that will be cancelled and many other requests made for smaller byte ranges. So if retrieving the PDF is a slow operation, perhaps because it's being retrieved from a remote location or a slow database, or perhaps because it might be modified by another process, then it makes sense to hold a copy of the PDF locally which can be discarded if there are no requests for a set period of time (we'd suggest 30 seconds to be safe).

Linearization and custom viewer features

If you've modified the viewer to add your own custom features, then there are more things to consider. First, if you're not loading linearized documents then you shouldn't need to worry too much: your features will still work, almost certainly without any changes required.

If you want to load linearized PDFs and use your custom features, then some work might be required. The main thing to remember is that a call to pdf.getPage(), or indeed any other code that returns a data structure of some sort from the PDF (form fields, bookmarks, file attachments etc.) might not return immediately - it might trigger a load. If you're doing this on the Swing thread then this will lock the thread, which of course is a bad thing.

To avoid this we've added the LinearizedSupport class to the viewer package. This is an easy way of adding callbacks, so your task will be run when the page is loaded. Let's say, for example, that your feature is going to jump to a specific page in the file when activated. Previously your code might have looked like this:

public void action(ViewerEvent event) {
    List pages = pdf.getPages();
    PDFPage page = pages.get(pagenumber);
    getViewer().getActiveDocumentPanel().setPage(page);
}
This will jump the viewer to the page when run, but if that page hasn't been loaded yet the Swing thread will lock until it has (on the pages.get() line), which will make the application unresponsive. A linearization-aware approach would be to replace this with the following:
public void action(ViewerEvent event) {
    final DocumentPanel dp = getViewer().getActiveDocumentPanel();
    LinearizedSupport support = dp.getLinearizedSupport();
    support.invokeOnPageLoadWithDialog(pagenumber, new Runnable() {
        public void run() {
            dp.setPage(pdf.getPage(pagenumber));
        }
    });
}
This will bring up a loading dialog while the requested page is loading and switch pages on completion - or, if the page is already loaded, will switch immediately. The LinearizedSupport class has several other methods which allow you to schedule tasks when the PDF has loaded the required section of the file.

BFO PDF Library 2.15 - but what happened to 2.14.1?

You released 2.14.1 of your PDF API yesterday, and today there's a 2.15. What are you people playing at? Read on, we'll explain.

Read more...

Tags :

Valuation Företagsvärderingar - creating professional PDF reports & graphs with BFO Software

Swedish company valuations leader creates valuation reports for clients with the BFO Report Generator and BFO Graph Library.

Read more...

Archiving PDF Documents with BFO for the Austrian Notaries Chamber

Long term PDF/A archiving for the Austrian Notaries Chamber, thanks to cyberDOC and BFO.

Read more...

Converting PDFs to bitmap PDFs

When only a raster will do, how to do it efficiently

There are many situations where a PDF has to be "rasterized" - the contents of each page turned into a bitmap image - such as when a PDF is being converted to PDF/A and the page contents cannot be repaired. This article shows how to do it efficiently.

Read more...

ObjectiveIT Integrates BFOs Report Generator into Insurance Tariff Comparison Software

ObjectiveIT develops an insurance tariff comparison solution for their insurance broker clients with the Report Generator.

Read more...

BFO releases Java PDF Library 2.13

A bundle of small changes, and the permissions framework.

We've put out our first PDF library in 5 months, and although there are a lot of small changes there are very few headline grabbers. Perhaps the most interesting is the ability to restrict operations in the viewer with permissions - here we go into that framework in a little more detail.

Read more...

The Firefox pdf.js Viewer

We've been getting a few emails asking about the new "pdf.js" viewer in Firefox, and why some of our documents don't render correctly in that viewer. Read on to find out why.

Read more...

Tags :

Odds and Ends - PDF Valentines Cards

Because it's friday

This challenge was too good to resist. We've neglected to make our cards PDF/A compliant, which you are welcome to interpret as a commentary on the impermanence of romantic love, or perhaps it would have just taken longer to do.

Either way we hope you had a happy Hallmark day. The code is below, and if you want to generate your own cards for someone you love (or even someone you don't) you can do so with this form.

import org.faceless.pdf2.*;
import java.io.*;
import javax.servlet.*;
import javax.servlet.http.*;
import java.awt.geom.*;
import java.awt.*;

public class ValentineServlet extends HttpServlet {

  public static PDF makeCard(String to, String from) {
    PDF pdf = new PDF();
    PDFPage page = pdf.newPage("A5");

    // Make a heart. 
    GeneralPath p = new GeneralPath();
    p.moveTo(0, -10);
    p.curveTo(20, 30, 60, 30, 70, -10);
    p.curveTo(70, -30, 60, -40, 50, -50);
    p.lineTo(0, -90);
    p.lineTo(-50, -50);
    p.curveTo(-60, -40, -70, -30, -70, -10);
    p.curveTo(-60, 30, -20, 30, 0, -10);
    Rectangle2D r = p.getBounds();

    float linewidth = 8;
    PDFCanvas heart = new PDFCanvas((float)r.getWidth()+linewidth, (float)r.getHeight()+linewidth);
    PDFStyle style = new PDFStyle();
    style.setFillColor(new Color(255, 0, 128));
    style.setLineColor(new Color(128, 0, 0));
    style.setLineWeighting(linewidth);
    heart.setStyle(style);
    heart.transform(AffineTransform.getTranslateInstance(-r.getMinX()+linewidth/2, -r.getMinY()+linewidth/2));
    heart.drawShape(p);

    // Draw loads of hearts randomly rotated and
    // positioned onto a canvas
    PDFCanvas canvas = new PDFCanvas(page.getWidth(), page.getHeight());
    for (int i=0;i<200;i++) {
      AffineTransform t = new AffineTransform();
      // Rotate left/right by <= 45°, scale up or down by factor of 2
      t.rotate((Math.random() - 0.5) * Math.PI / 2);
      t.translate(Math.random() * page.getWidth() - heart.getWidth()/2, Math.random() * page.getHeight() - heart.getHeight()/2);
      double scale = 1 / (Math.random() + 0.5);
      t.scale(scale, scale);
      canvas.save();
      canvas.transform(t);
      canvas.drawCanvas(heart, 0, 0, heart.getWidth(), heart.getHeight());
      canvas.restore();
    }
    page.drawCanvas(canvas, 0, 0, canvas.getWidth(), canvas.getHeight());

    // Add the text
    PDFCanvas canvas = new PDFCanvas(page.getWidth(), page.getHeight());
    PDFStyle textstyle = new PDFStyle();
    textstyle.setFillColor(Color.white);
    textstyle.setLineColor(Color.black);
    textstyle.setFontStyle(PDFStyle.FONTSTYLE_FILLEDOUTLINE);
    textstyle.setFont(new StandardFont(StandardFont.HELVETICA), 40);

    PDFStyle smallstyle = new PDFStyle(textstyle);
    smallstyle.setFont(new StandardFont(StandardFont.HELVETICABOLDOBLIQUE), 24);

    LayoutBox box = new LayoutBox(page.getWidth());
    if (to != null) {
      box.addText("Dear "+to+"\n\n", smallstyle, null);
    }
    box.addText("Roses are red\nViolets are blue\nHere's a PDF\nJust for You\n\n", textstyle, null);
    textstyle.setFont(new StandardFont(StandardFont.HELVETICABOLDOBLIQUE), 24);
    box.addText("Nothing says \"I Love You\"\nlike ISO PDF 32000-1:2008.\n\n", smallstyle, null);
    box.addText("Happy Valentines Day\nfrom ", smallstyle, null);
    if (from != null) {
      box.addText(from+" and ", smallstyle, null);
    }
    box.addText("BFO", smallstyle, null);
    page.drawLayoutBox(box, 50, 500);

    return pdf;
  }

  public void doGet(HttpServletRequest req, HttpServletResponse res) throws IOException {
    String from = req.getParameter("from");
    String to = req.getParameter("to");
    if (from != null && from.trim().length() == 0) {
      from = null;
    }
    if (to != null && to.trim().length() == 0) {
      to = null;
    }
    PDF pdf = makeCard(to, from);
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    pdf.render(out);
    out.close();

    res.setContentLength(out.size());
    res.setContentType("application/pdf");
    out.writeTo(res.getOutputStream());
  }

  public static void main(String[] args) throws Exception {
    String to = args.length > 0 ? args[0] : null;
    String from = args.length > 1 ? args[1] : null;
    PDF pdf = makeCard(to, from);
    pdf.render(new FileOutputStream("valentine.pdf"));
  }

}
Tags :

XFA Forms

A FAQ

The "P" in PDF stands for "Portable", and PDF is now an ISO Specification. So you could be forgiven for being surprised when you learn about XFA. We're asked about it a lot so what follows is a bit of a FAQ.

What is XFA

XFA stands for "XML Forms Architecture", and it's been part of Acrobat since Acrobat 6. It's an XML syntax which defines the document (the whole document, not just the form fields) and is embedded inside the PDF. While the specification itself is open and available, it's not part of the ISO PDF specification. It's also long (1500+ pages) and complex, having gone through 10 revisions since Acrobat 6.

Why does it exist?

Well, the original forms in PDF are arguably a bit of a flawed design and there are a lot of things that could have been done better, so there was room for improvement. XFA is a dialect of XML, which is a sensible container format, and it separates data from content in much the same way as the W3C XForms specification, which is undeniably a good thing.

So what are the problems with XFA

Personally I have quite a list, but the main one is XFA replaces, not augments, the PDF specification: the PDF file is now just a container, and the entire document is defined in the XFA layer. It undoubtedly warranted a new XFA file format; so by trying to elbow it in via the existing standard of PDF Adobe ensured a generation of confusion and annoyance from third party vendors and their customers.

Which tools support it?

For full support, you need Adobe's own products. Our API has limited support as described below, and we expect other third-party products to have support ranging from none to limited.

How do I create an XFA document?

You need an XFA-aware PDF producer, which is likely to be Adobe LiveCycle. When you save your document it will save it as an XFA PDF, and you'll have two options:
  • By default, the XFA-enabled PDF is just a basic shell around the XFA document. The entire document is defined in XFA, and an application that's not aware of XFA simply gets a single page PDF requesting you use a newer version of Acrobat.
  • You can also save your XFA PDF in "compatibility" mode, which will also create the pages, form fields and other content in the normal PDF way - the document is effectively stored twice, once as XFA, once as PDF. An XFA-aware application like Acrobat will read from the XFA layer (and ignore the PDF layer), and a non-XFA aware application will ignore the XFA layer and use the PDF layer. Obviously, subsequent edits should be made with a tool that can keep the two in sync.

What support do BFO tools have for XFA?

  • For PDFs saved without the "compatibility" layer, almost none. You can retrieve or update the XFA object as an XML document, or you can update just the "datasets" object, which is the data model. This effectively allows you to read and write the form values, although you can't see the fields themselves. You can also read/write the document metadata (author, title etc.) but anything else related to document content is unavailable: you can't access the document pages (the pagelist will always return a single dummy page) and you can't view or edit the form fields.
  • PDFs saved with a "compatibility" layer can be accessed for reading in a normal way - the PDF pages are valid so you can display them in our viewer or list the form fields and their content. You can also update the values of the form fields (we synchronize the XFA data to match) but any other changes to the PDF will not be synchronized and so will be ignored by Acrobat - so changes like this should be avoided.

    The final thing you can do with compatibility XFA documents is delete the XFA layer. Once removed, Acrobat will treat the PDF as a normal PDF and pages can be modified, form fields added or removed without problems.

How do I know what sort of PDF I have?

  • To identify an XFA document, you can check the XFAForm feature in the PDF OutputProfile:
    boolean xfa = pdf.getBasicOutputProfile().isSet(OutputProfile.Feature.XFAForm);
    
  • Identifying a non-compatibility layer PDF is trickier. Our API will only find a single page and no form fields, and most XFA documents would contain at least one field so this is probably a good test. The only way to know for sure is if you open the PDF with our viewer (or any non-Acrobat viewer) and you see "To view the full contents of this document, you need a later version of the PDF viewer", "If this message is not eventually replaced by the proper contents of the document, your PDF viewer may not be able to display this type of document." or other words to that effect.

How do I delete the XFA layer and what are the consequences?

How is very simple: with our API, just call
pdf.getForm().removeXFA();
There are some XFA features that cannot be supported in PDF. For example, if your form allows you to choose date fields from a date picker then deleting the XFA will remove that functionality, and in general anything related to field validation will probably go as well (although with some effort it's probably possible to reimplement this with regular PDF JavaScript).

For documents that are going through a final stage of processing before being sent out, and where the customer isn't expected to modify the form, removing the XFA layer should be fine.

What is the best practice for using XFA?

  1. If there are any products in the PDF's life cycle that are not produced by Acrobat - this includes general tools like ours, PDF viewers (perhaps on your customers machine), any archival requirements like those imposed by PDF/A or print service suppliers - then the best practice is to avoid it. Support from third-party vendors is extremely limited and likely to stay that way.
  2. If you have to use XFA, then always save your PDF with a "compatibility" layer. This will allow basic modifications as described above, and will give you the option of deleting the XFA layer if necessary.
Tags : ,

How to print with "Comments Summary"

This article shows how you can create a custom viewer feature that duplicates the functionality of Acrobat's "Print with Comments Summary" feature.

Read more...

New features in PDF Library 2.12

What have we been up to?

Yesterday we released our first PDF Library for a few months, version 2.12, so it's a good to give a bit of a summary of the changes

Read more...

Client Customizes PDF Viewer Using Source Code

Client adopts BFO's customizable Java PDF Viewer for their project.

Read more...

New features in the PDF Library 2.11.25

A quick summary

We've recently released version 2.11.25 - here's a quick summary of some of the features.

Read more...

Barcode Fields

Taking the data-entry out of printed forms

Acrobat added dynamically updated barcode fields in Acrobat 7, but they haven't been documented until now, in the upcoming PDF 2.0 specification. This article shows you how to use them to make data extraction from printed forms a lot easier.

Read more...

Handwritten Digital Signatures

A new feature in 2.11.25 of the PDF Library is the ability to capture handwritten signatures from an iPad, iPhone or Android tablet. Useful? Maybe not, but it is kinda neat as you can see in the video.

Read more...

Roll your own applet: the definitive guide

A step-by-step guide to applet deployment.

We've covered them before, but Applets - a technology launched in 1996 with Java 1.0 - just keep changing. What follows is the definitive, step-by-step guide to compiling our viewer as an Applet useful as of mid-2012.

Read more...

Create PDF Tax Receipts in Real-Time Using BFOs Report Generator

Benevity, a micro-donation platform, implements BFOs Java Report Generator to create PDF receipts and invoices on the fly.

Read more...

Reader Extensions, and why they break

Usually on this blog we cover details of things you can do with our PDF API, so it's a bit of a departure to cover something we can't. Read on to learn about Reader Extensions and the limitations they imply.

Read more...

Tags :