The Firefox pdf.js Viewer

A few weeks ago Firefox released a free PDF viewer plugin, and a few days after that we started getting emails about it. The emails usually read:

We have created a PDF with your software, loaded it in Firefox and it doesn't look correct. What's up?

PDF is complex. It was first developed in 1994 and has gone through about 9 revisions since then. It was also a specification that (until very recently) was developed by one company to match the latest features in their product, and while I'm not privy to Adobe's internal policy I'd bet good money that the specification was written to match the functionality, not the other way around.

As anyone who has subsequently implemented a specification written this way knows, that makes for bad specifications. The logic behind a definition isn't normally clear, and there are always undocumented requirements - undocumented because the author thought they were too obvious to document, or because no-one thought to ask the question.

What this means for anyone developing PDF software is that following the specification will get you half-way there. The rest is testing, and after 11 years we are still getting documents from customers that don't work quite the way they should. "pdf.js" is new software, and it has some way to go - although its arrival indicates the continuing health of the PDF ecosystem, so we're glad to have them about.

Question: Who is correct?

If you create a PDF with our software display it with Firefox (or any other viewer) and it looks incorrect, how do you know who's at fault? We're pointing the finger at the other product, but then we would, wouldn't we? Ultimately you just need to decide which support team to email.

With a specification that had been designed from the start with verification in mind, this would be an easy question - for example, an XML schema can verify an XML document as being formally correct.

That's not currently possible with PDF, although there has been some interest in the industry in changing this. And for good reason: as I've said before, the "P" in PDF standard for portable, so every PDF that can't be exchanged is a cost to industry. Multiply that by the billions of documents in existence and you'll see that being able to categorically point the finger of blame at someone else has huge financial implications.

Formal Answer: We'll never know

In my opinion, formal verification of PDF is a nice idea but isn't going to happen, for two reasons. First, here's a partial list of specifications that need to be referenced to implement it:

ISO PDF-32000
Adobe TN5004 - AFM
Adobe TN5014 - CID font CMap
Adobe TN5015 - Type 1
Adobe TN5096 - Japan 1-6 character collection
Adobe TN5079 - GB 1-4 character collection
Adobe TN5080 - CNS 1-4 character collection
Adobe TN5092 - CID overview
Adobe TN5093 - Korea 1-2 character collection
Adobe TN5094 - CJKV collections
Adobe TN5116 - Discrete Cosine decompression
Adobe TN5176 - CFF specification
Adobe TN5177 - Type 2 charstring
Adobe TN5411 - ToUnicode mapping
Adobe TN5604 - DeviceN colorspace mapping
Adobe TN5641 - Embedding CID fonts
Adobe Signature Build Dictionary specification
Acrobat JavaScript Reference
Adobe Digital Signature User Guide
Adobe Standard Glyph List
ITU-T X.680 ASN.1
ITU-T X.690 BER encoding rules
ITU-T T.0-T.63 and T4 for CCITT G3
ITU-T T.6 for CCITT G4
ISO 639 - Language codes
ISO 3166 - Country codes
ISO 10918 - JPEG standard (see also TN5116 above
ISO 15444 - JPEG2000
ISO 11544 - JBIG2
IEC/3WD 61966-2.1 - sRGB ColorSpace
ISO 15076 - ICC ColorSpace file
ISO 10646 - Unicode
ANSI X3.4-1986 - ASCII
Acrobat 3D JavaScript reference
Open Prepress Specification 1.3
Adobe XML Forms Architecture specification (XFA) - all 8 revisions
Adobe XML Data Package specification
Adobe XMP specification
TIFF Revision 6
Adobe TN5087 - Multiple Master fonts
Adobe TN5088 - Font naming issues
Adobe TN5062 - Portable Job Format specification
Adobe TN5660 - Open Prepress specification 2.0
FIPS PUB 186-2 - Digital Signature Standard
FIPS PUB 197 - AES standard
RFC1321 - MD5 hash algorithm
RFC1738 and 1808- URLs
RFC1950 and 1951 - Zlib compression
RFC2045 - MIME
RFC2083 - PNG
RFC2315 - PKCS#7
RFC2560 - X.509
RFC2616 - HTTP
RFC2898 - PKCS#5
RFC3066 - language codes (see also ISO3166)
RFC3161 - Time stamp protocol
RFC3174 - SHA1
RFC3280 - CRLs
RFC5702 - SHA2
CSS 2 specification
OpenType font specification
TrueType specification
ECMA-363 Universal 3D specification
PANOSE font metrics guide
ICC color registry
Ecmascript specification
Unicode annex 9 - bidirectional algorithm
Unicode annex 14 - line breaking
Unicode annex 29 - text boundaries
XML specification
RFC3629 - UTF-8
GB 2312-80 character set
GB 18030-2000
Big Five character set
Hong Kong special character set
ETen extensions to Big Five
CNS 11643-1992
JIS X 0208
Shift-JIS
JIS C 6226
ISO-2022
JIS X 0213
KS X 1001
Dublin Core
RDP
XML schema
PostScript specification
PKCS #1 - RSA
JFIF JPEG file interchange specification
ETSI TS 102 293 PAdES
ETSI TS 101 733 CAdES
ETSI TS 102 788-1 and -2 PAdES for PDF
ISO-15438 PDF417 Barcode specification
ISO-19005 PDF/A
PDF/A technote 0001 - namespaces
PDF/A technote 0002 - color
PDF/A technote 0003 - metadata
PDF/A technote 0006 - predefined XMP property list
PDF/A technote 0008 - XMP extension schema
ISO-15930 PDF/X
ISO-18004 QR-Codes
HTML 3.2
ISO-16022 Datamatrix
SWF file format specification

We've probably missed a few (PDF/E and PDF/UA spring to mind), and there are a few aspects (e.g. the variations in CMYK colorspace handling in DCT encoded images with Photoshop markers, the mapping of glyph to character code when the embedded TrueType font is missing the required tables, the correct behaviour when widget and form fields mismatch) which are, as best as we can determine, undocumented.

Do you need all of the above specifications to implement PDF correctly? No, not all of them - we have only about 3/4 of these in our library, and some are used only when creating a PDF, not when parsing. But if you want to verify a PDF is correct then you'll need to verify against many of the above. For this reason determining if a PDF is formally "correct" is an extremely difficult task.

The second reason why verification of PDF probably isn't going to happen is it will immediately identify half the world's PDF documents as invalid in some way. The fallout from this is in no-one's interest, so we (as an industry) make do with good enough.

Informal answer: ask Acrobat

So given the above, how do we know if a document is displaying correctly? Very simple: we compare it with Acrobat. Although it's been argued that even Acrobat implements its own specification incorrectly at times, in general that's the benchmark we use. If you have a document that's rendering incorrectly in Firefox, or any other viewer, we recommend that's the benchmark you use too - if it's correct in Acrobat, I suggest you drop the Firefox developers a line. Like us, I'm sure they'll be pleased to have another test-case to work with, as it will help them build a better product. Although they may groan when your email arrives...

Tags: pdflibrary pdfviewer firefox

Posted by Mike Bremford on 14 Mar 2013 at 13:00

Previous Article Next Article New Comment Back to index

Name
Email
Subject