A few weeks ago Firefox released a free PDF viewer plugin, and a few days after that we started getting emails about it. The emails usually read:
We have created a PDF with your software, loaded it in Firefox and it doesn't look correct. What's up?
PDF is complex. It was first developed in 1994 and has gone through about 9 revisions since then. It was also a specification that (until very recently) was developed by one company to match the latest features in their product, and while I'm not privy to Adobe's internal policy I'd bet good money that the specification was written to match the functionality, not the other way around.
As anyone who has subsequently implemented a specification written this way knows, that makes for bad specifications. The logic behind a definition isn't normally clear, and there are always undocumented requirements - undocumented because the author thought they were too obvious to document, or because no-one thought to ask the question.
What this means for anyone developing PDF software is that following the specification will get you half-way there. The rest is testing, and after 11 years we are still getting documents from customers that don't work quite the way they should. "pdf.js" is new software, and it has some way to go - although its arrival indicates the continuing health of the PDF ecosystem, so we're glad to have them about.
Question: Who is correct?
If you create a PDF with our software display it with Firefox (or any other viewer) and it looks incorrect, how do you know who's at fault? We're pointing the finger at the other product, but then we would, wouldn't we? Ultimately you just need to decide which support team to email.
With a specification that had been designed from the start with verification in mind, this would be an easy question - for example, an XML schema can verify an XML document as being formally correct.
That's not currently possible with PDF, although there has been some interest in the industry in changing this. And for good reason: as I've said before, the "P" in PDF standard for portable, so every PDF that can't be exchanged is a cost to industry. Multiply that by the billions of documents in existence and you'll see that being able to categorically point the finger of blame at someone else has huge financial implications.
Formal Answer: We'll never know
In my opinion, formal verification of PDF is a nice idea but isn't going to happen, for two reasons. First, here's a partial list of specifications that need to be referenced to implement it:
- ISO PDF-32000
- Adobe TN5004 - AFM
- Adobe TN5014 - CID font CMap
- Adobe TN5015 - Type 1
- Adobe TN5096 - Japan 1-6 character collection
- Adobe TN5079 - GB 1-4 character collection
- Adobe TN5080 - CNS 1-4 character collection
- Adobe TN5092 - CID overview
- Adobe TN5093 - Korea 1-2 character collection
- Adobe TN5094 - CJKV collections
- Adobe TN5116 - Discrete Cosine decompression
- Adobe TN5176 - CFF specification
- Adobe TN5177 - Type 2 charstring
- Adobe TN5411 - ToUnicode mapping
- Adobe TN5604 - DeviceN colorspace mapping
- Adobe TN5641 - Embedding CID fonts
- Adobe Signature Build Dictionary specification
- Acrobat JavaScript Reference
- Adobe Digital Signature User Guide
- Adobe Standard Glyph List
- ITU-T X.680 ASN.1
- ITU-T X.690 BER encoding rules
- ITU-T T.0-T.63 and T4 for CCITT G3
- ITU-T T.6 for CCITT G4
- ISO 639 - Language codes
- ISO 3166 - Country codes
- ISO 10918 - JPEG standard (see also TN5116 above
- ISO 15444 - JPEG2000
- ISO 11544 - JBIG2
- IEC/3WD 61966-2.1 - sRGB ColorSpace
- ISO 15076 - ICC ColorSpace file
- ISO 10646 - Unicode
- ANSI X3.4-1986 - ASCII
- Acrobat 3D JavaScript reference
- Open Prepress Specification 1.3
- Adobe XML Forms Architecture specification (XFA) - all 8 revisions
- Adobe XML Data Package specification
- Adobe XMP specification
- TIFF Revision 6
- Adobe TN5087 - Multiple Master fonts
- Adobe TN5088 - Font naming issues
- Adobe TN5062 - Portable Job Format specification
- Adobe TN5660 - Open Prepress specification 2.0
- FIPS PUB 186-2 - Digital Signature Standard
- FIPS PUB 197 - AES standard
- RFC1321 - MD5 hash algorithm
- RFC1738 and 1808- URLs
- RFC1950 and 1951 - Zlib compression
- RFC2045 - MIME
- RFC2083 - PNG
- RFC2315 - PKCS#7
- RFC2560 - X.509
- RFC2616 - HTTP
- RFC2898 - PKCS#5
- RFC3066 - language codes (see also ISO3166)
- RFC3161 - Time stamp protocol
- RFC3174 - SHA1
- RFC3280 - CRLs
- RFC5702 - SHA2
- CSS 2 specification
- OpenType font specification
- TrueType specification
- ECMA-363 Universal 3D specification
- PANOSE font metrics guide
- ICC color registry
- Ecmascript specification
- Unicode annex 9 - bidirectional algorithm
- Unicode annex 14 - line breaking
- Unicode annex 29 - text boundaries
- XML specification
- RFC3629 - UTF-8
- GB 2312-80 character set
- GB 18030-2000
- Big Five character set
- Hong Kong special character set
- ETen extensions to Big Five
- CNS 11643-1992
- JIS X 0208
- Shift-JIS
- JIS C 6226
- ISO-2022
- JIS X 0213
- KS X 1001
- Dublin Core
- RDP
- XML schema
- PostScript specification
- PKCS #1 - RSA
- JFIF JPEG file interchange specification
- ETSI TS 102 293 PAdES
- ETSI TS 101 733 CAdES
- ETSI TS 102 788-1 and -2 PAdES for PDF
- ISO-15438 PDF417 Barcode specification
- ISO-19005 PDF/A
- PDF/A technote 0001 - namespaces
- PDF/A technote 0002 - color
- PDF/A technote 0003 - metadata
- PDF/A technote 0006 - predefined XMP property list
- PDF/A technote 0008 - XMP extension schema
- ISO-15930 PDF/X
- ISO-18004 QR-Codes
- HTML 3.2
- ISO-16022 Datamatrix
- SWF file format specification
We've probably missed a few (PDF/E and PDF/UA spring to mind), and there are a few aspects (e.g. the variations in CMYK colorspace handling in DCT encoded images with Photoshop markers, the mapping of glyph to character code when the embedded TrueType font is missing the required tables, the correct behaviour when widget and form fields mismatch) which are, as best as we can determine, undocumented.
Do you need all of the above specifications to implement PDF correctly? No, not all of them - we have only about 3/4 of these in our library, and some are used only when creating a PDF, not when parsing. But if you want to verify a PDF is correct then you'll need to verify against many of the above. For this reason determining if a PDF is formally "correct" is an extremely difficult task.
The second reason why verification of PDF probably isn't going to happen is it will immediately identify half the world's PDF documents as invalid in some way. The fallout from this is in no-one's interest, so we (as an industry) make do with good enough.
Informal answer: ask Acrobat
So given the above, how do we know if a document is displaying correctly? Very simple: we compare it with Acrobat. Although it's been argued that even Acrobat implements its own specification incorrectly at times, in general that's the benchmark we use. If you have a document that's rendering incorrectly in Firefox, or any other viewer, we recommend that's the benchmark you use too - if it's correct in Acrobat, I suggest you drop the Firefox developers a line. Like us, I'm sure they'll be pleased to have another test-case to work with, as it will help them build a better product. Although they may groan when your email arrives...