Gotchas when reading and writing PDF to files

When we added support for editing existing PDFs back in 2001, things were simple: the PDF was read entirely into memory, and any subsequent changes to the file it was read from didn't matter. Life was simple but customers would occasionally run out of memory, so we had to investigate alternatives.

Enter 2008. Releases since 2.10 no longer require the entire PDF to be read into memory - instead, sections of the PDF may optionally be read in as needed from the original file, using the java.nio package. While for certain customers this has been a huge help it's not without problems, and this article is intended to describe those problems and the solutions for them added in release 2.11.2.

Avoiding I/O completely

First, you don't have to use NIO. If you create a PDFReader with an InputStream rather than a File, the entire PDF will be read into RAM as before. This is faster and much less complicated, so when memory isn't an issue this is the best option.

Alternatively, passing a File into the PDFReader constructor will cause the original file to be referenced throughout the life of the PDF object. Large binary objects like images, fonts and page contents are loaded as needed from the file, reducing memory use. As a further bonus, if you save that PDF to disk the java.nio package can copy data directly from source to destination file, without having to move the data through the JVM. On many operating systems this is done at the OS level and can be much faster than using the Java I/O package.

The down-side of I/O

These benefits do not come for free. When using this approach the most important rule is do not alter the original file. The following code will quickly lead to corruption, because you're writing to the same file you're reading from:
File file = new File("file.pdf");
PDF pdf = new PDFReader(file);

// Changes to the PDF go here

OutputStream out = new FileOutputStream(file);
pdf.render(out);
out.close();

// Don't do this! Resulting PDF is possibly corrupt

The other rules are more generic. First, if your PDF file is on a network share this approach will be slower as there's more I/O involved. Second, the more PDFs you work with at once the more open file descriptors you'll have. Operating systems define an upper limit to this, so depending on what else your JVM is doing this may be an issue (UNIX systems can increase this using the ulimit command).

Why can't you prevent a file being overwritten in Java?

You might wonder why we don't prevent Java from writing to a file we're currently reading from? The answer is there's no guaranteed way to do this in Java.

First, given a FileOutputStream or FileInputStream, there's no way to identify the original file it references. We feel this is an odd omission from the java.io package, but without it there's no way to identify if two streams refer to the same file. Even if there was, wrapping the output in a BufferedOutputStream would mask this.

The java.nio package introduces FileLock, and at first glance this should help. However, there's a problem.

File file1 = new File("myfile");
FileInputStream in = new FileInputStream(file);
FileLock rlock = in.getChannel().lock(0, max, true);

FileOutputStream out = new FileOutputStream(file);
FileLock wlock = out.getChannel().lock();
//Locking exception is thrown

Looks fine? Look closer. You can't obtain a write lock without opening a file for writing - which zeroes the file. The solution is to ensure you always write with a RandomAccessFile, which is less than ideal. FileLocks are useful in some situations, but not this one. Caveat Programmer.

But I really need to write to the same file!

What if you have to write to the same file? Luckily there are a couple of solutions: the first applies if you can save then immediately discard the PDF, the second if you need to continue to work on the PDF after saving.

Save then rename

The correct approach when saving to the same file is to save to a temporary file, then rename. This applies well beyond the PDF library - any software with this requirement should take this approach. In Java it's quite simple: here's the above example again with the new lines marked
File file = new File("file.pdf");
PDF pdf = new PDFReader(file);

// Changes to the PDF go here

File dir = file.getParentFile();
File temp = File.createTempFile("pdftemp", null, dir);
OutputStream out = new FileOutputStream(temp);
pdf.render(out);
out.close();
pdf.close();
temp.delete(file);
temp.renameTo(file);

// PDF is saved correctly but the "pdf" object is now invalid

This will overwrite the original PDF file safely, but leaving the PDF object in a state where it can no longer be used. Some points:

  1. Creating the temp file in the same directory is a good idea, as it ensures renaming will not require a copy between filesystems.
  2. In 2.11.2 and later, the PDF.close() method will ensure the original file is closed. Without it you'd have to wait until the PDF was finalized before you can delete the file it was read from.
  3. Why delete it first? On Windows, File.renameTo will fail if the destination file exists. On UNIX it's not necessary.

For many situations that's enough, but what about where we need to keep editing the PDF after save? This is common practice for GUI applications, for example.

Multiple Revisions

One nice aspect of the PDF file format is the ability to update a PDF by appending to the file. This creates slightly larger files, so it's normally only done by the PDF library when the document is digitally signed. However in this situation it's useful because updating by appending a revision leaves existing objects in the same place in the file: any references to file positions will remain valid. Here's an updated example with new lines marked.

File file = new File("file.pdf");
PDF pdf = new PDFReader(file);

// Changes to the PDF go here

OutputProfile profile = pdf.getBasicOutputProfile();
profile.setRequired(OutputProfile.Feature.MultipleRevisions);

File dir = file.getParentFile();
File temp = File.createTempFile("pdftemp", null, dir);
OutputStream out = new FileOutputStream(temp);
pdf.render(out);
out.close();
pdf.close();
file.delete();
temp.renameTo(file);

// Both PDF file and the "pdf" object are valid

The resulting file will be larger, although how much really depends on what changes you make to the PDF before saving. However if you really need to save back to the same file you opened from, it's a guaranteed way of avoiding data corruption.