When we added support for editing existing PDFs back in 2001, things were simple: the PDF was read entirely into memory, and any subsequent changes to the file it was read from didn't matter. Life was simple but customers would occasionally run out of memory, so we had to investigate alternatives.
Enter 2008. Releases since 2.10 no longer require the entire PDF to be read into memory
- instead, sections of the PDF may optionally be read in as needed from the original
file, using the java.nio
package. While for certain customers this has been a huge help it's not without problems,
and this article is intended to describe those problems and the solutions for them
added in release 2.11.2.
Avoiding I/O completely
First, you don't have to use NIO. If you create a PDFReader
with an InputStream
rather than a File
, the entire PDF will be read into RAM as before. This is faster and much less complicated,
so when memory isn't an issue this is the best option.
Alternatively, passing a File
into the PDFReader
constructor will cause the original file to be referenced throughout the life of
the PDF object. Large binary objects like images, fonts and page contents are loaded
as needed from the file, reducing memory use. As a further bonus, if you save that
PDF to disk the java.nio
package can copy data directly from source to destination file, without having to
move the data through the JVM. On many operating systems this is done at the OS level
and can be much faster than using the Java I/O package.
The down-side of I/O
These benefits do not come for free. When using this approach the most important rule is do not alter the original file. The following code will quickly lead to corruption, because you're writing to the same file you're reading from:File file = new File("file.pdf"); PDF pdf = new PDFReader(file); // Changes to the PDF go here OutputStream out = new FileOutputStream(file); pdf.render(out); out.close(); // Don't do this! Resulting PDF is possibly corrupt
The other rules are more generic. First, if your PDF file is on a network share this
approach will be slower as there's more I/O involved. Second, the more PDFs you work
with at once the more open file descriptors you'll have. Operating systems define
an upper limit to this, so depending on what else your JVM is doing this may be an
issue (UNIX systems can increase this using the ulimit
command).
Why can't you prevent a file being overwritten in Java?
You might wonder why we don't prevent Java from writing to a file we're currently reading from? The answer is there's no guaranteed way to do this in Java.
First, given a FileOutputStream
or FileInputStream
, there's no way to identify the original file it references. We feel this is an odd
omission from the java.io
package, but without it there's no way to identify if two streams refer to the same
file. Even if there was, wrapping the output in a BufferedOutputStream
would mask this.
The java.nio
package introduces FileLock
, and at first glance this should help. However, there's a problem.
File file1 = new File("myfile"); FileInputStream in = new FileInputStream(file); FileLock rlock = in.getChannel().lock(0, max, true); FileOutputStream out = new FileOutputStream(file); FileLock wlock = out.getChannel().lock(); //Locking exception is thrown
Looks fine? Look closer. You can't obtain a write lock without opening a file for
writing - which zeroes the file. The solution is to ensure you always write with a
RandomAccessFile
, which is less than ideal. FileLock
s are useful in some situations, but not this one. Caveat Programmer.
But I really need to write to the same file!
What if you have to write to the same file? Luckily there are a couple of solutions: the first applies if you can save then immediately discard the PDF, the second if you need to continue to work on the PDF after saving.
Save then rename
The correct approach when saving to the same file is to save to a temporary file, then rename. This applies well beyond the PDF library - any software with this requirement should take this approach. In Java it's quite simple: here's the above example again with the new lines markedFile file = new File("file.pdf"); PDF pdf = new PDFReader(file); // Changes to the PDF go here File dir = file.getParentFile(); File temp = File.createTempFile("pdftemp", null, dir); OutputStream out = new FileOutputStream(temp); pdf.render(out); out.close(); pdf.close(); temp.delete(file); temp.renameTo(file); // PDF is saved correctly but the "pdf" object is now invalid
This will overwrite the original PDF file safely, but leaving the PDF object in a state where it can no longer be used. Some points:
- Creating the temp file in the same directory is a good idea, as it ensures renaming will not require a copy between filesystems.
- In 2.11.2 and later, the
PDF.close()
method will ensure the original file is closed. Without it you'd have to wait until the PDF was finalized before you can delete the file it was read from. - Why delete it first? On Windows,
File.renameTo
will fail if the destination file exists. On UNIX it's not necessary.
For many situations that's enough, but what about where we need to keep editing the PDF after save? This is common practice for GUI applications, for example.
Multiple Revisions
One nice aspect of the PDF file format is the ability to update a PDF by appending to the file. This creates slightly larger files, so it's normally only done by the PDF library when the document is digitally signed. However in this situation it's useful because updating by appending a revision leaves existing objects in the same place in the file: any references to file positions will remain valid. Here's an updated example with new lines marked.
File file = new File("file.pdf"); PDF pdf = new PDFReader(file); // Changes to the PDF go here OutputProfile profile = pdf.getBasicOutputProfile(); profile.setRequired(OutputProfile.Feature.MultipleRevisions); File dir = file.getParentFile(); File temp = File.createTempFile("pdftemp", null, dir); OutputStream out = new FileOutputStream(temp); pdf.render(out); out.close(); pdf.close(); file.delete(); temp.renameTo(file); // Both PDF file and the "pdf" object are valid
The resulting file will be larger, although how much really depends on what changes you make to the PDF before saving. However if you really need to save back to the same file you opened from, it's a guaranteed way of avoiding data corruption.