When we added support for editing existing PDFs back in 2001, things were simple: the PDF was read entirely into memory, and any subsequent changes to the file it was read from didn't matter. Life was simple but customers would occasionally run out of memory, so we had to investigate alternatives.
Enter 2008. Releases since 2.10 no longer require the entire PDF to be read into memory
- instead, sections of the PDF may optionally be read in as needed from the original
file, using the
java.nio package. While for certain customers this has been a huge help it's not without problems,
and this article is intended to describe those problems and the solutions for them
added in release 2.11.2.
Avoiding I/O completely
First, you don't have to use NIO. If you create a
PDFReader with an
InputStream rather than a
File, the entire PDF will be read into RAM as before. This is faster and much less complicated,
so when memory isn't an issue this is the best option.
Alternatively, passing a
File into the
PDFReader constructor will cause the original file to be referenced throughout the life of
the PDF object. Large binary objects like images, fonts and page contents are loaded
as needed from the file, reducing memory use. As a further bonus, if you save that
PDF to disk the
java.nio package can copy data directly from source to destination file, without having to
move the data through the JVM. On many operating systems this is done at the OS level
and can be much faster than using the Java I/O package.
The down-side of I/OThese benefits do not come for free. When using this approach the most important rule is do not alter the original file. The following code will quickly lead to corruption, because you're writing to the same file you're reading from:
File file = new File("file.pdf"); PDF pdf = new PDFReader(file); // Changes to the PDF go here OutputStream out = new FileOutputStream(file); pdf.render(out); out.close(); // Don't do this! Resulting PDF is possibly corrupt
The other rules are more generic. First, if your PDF file is on a network share this
approach will be slower as there's more I/O involved. Second, the more PDFs you work
with at once the more open file descriptors you'll have. Operating systems define
an upper limit to this, so depending on what else your JVM is doing this may be an
issue (UNIX systems can increase this using the
But I really need to write to the same file!
What if you have to write to the same file? Luckily there are a couple of solutions: the first applies if you can save then immediately discard the PDF, the second if you need to continue to work on the PDF after saving.
Save then renameThe correct approach when saving to the same file is to save to a temporary file, then rename. This applies well beyond the PDF library - any software with this requirement should take this approach. In Java it's quite simple: here's the above example again with the new lines marked
File file = new File("file.pdf"); PDF pdf = new PDFReader(file); // Changes to the PDF go here File dir = file.getParentFile(); File temp = File.createTempFile("pdftemp", null, dir); OutputStream out = new FileOutputStream(temp); pdf.render(out); out.close(); pdf.close(); temp.delete(file); temp.renameTo(file); // PDF is saved correctly but the "pdf" object is now invalid
This will overwrite the original PDF file safely, but leaving the PDF object in a state where it can no longer be used. Some points:
- Creating the temp file in the same directory is a good idea, as it ensures renaming will not require a copy between filesystems.
- In 2.11.2 and later, the
PDF.close()method will ensure the original file is closed. Without it you'd have to wait until the PDF was finalized before you can delete the file it was read from.
- Why delete it first? On Windows,
File.renameTowill fail if the destination file exists. On UNIX it's not necessary.
For many situations that's enough, but what about where we need to keep editing the PDF after save? This is common practice for GUI applications, for example.
One nice aspect of the PDF file format is the ability to update a PDF by appending to the file. This creates slightly larger files, so it's normally only done by the PDF library when the document is digitally signed. However in this situation it's useful because updating by appending a revision leaves existing objects in the same place in the file: any references to file positions will remain valid. Here's an updated example with new lines marked.
File file = new File("file.pdf"); PDF pdf = new PDFReader(file); // Changes to the PDF go here OutputProfile profile = pdf.getBasicOutputProfile(); profile.setRequired(OutputProfile.Feature.MultipleRevisions); File dir = file.getParentFile(); File temp = File.createTempFile("pdftemp", null, dir); OutputStream out = new FileOutputStream(temp); pdf.render(out); out.close(); pdf.close(); file.delete(); temp.renameTo(file); // Both PDF file and the "pdf" object are valid
The resulting file will be larger, although how much really depends on what changes you make to the PDF before saving. However if you really need to save back to the same file you opened from, it's a guaranteed way of avoiding data corruption.