When is a PDF not a PDF?

Yesterday I was doing some experiments with Adobe Illustrator. According to some web sites, The CS5 version saves its files as PDF, though with the extension .AI. When you save a file, though, the options dialog has a checkbox labeled “Create PDF Compatible File.” I unchecked it and saved the file, then opened it in JHOVE. JHOVE says it’s perfectly good PDF — indeed, PDF/A. Then I tried opening it in Preview, and this is what it looked like:

File says over and over that it was saved without PDF content

If you don’t actually look at the file but trust the mere fact that it’s a PDF, you might put it into a repository and find out later on that it’s worthless as a PDF. What’s happening is that PDF can embed any kind of content, and this one embeds its native PGF data. Any PDF reader can open the file, but only an application that understands PGF can use its actual content. Anyone putting PDF into a repository should be aware of this risk.

It’s outside the scope of JHOVE to check whether embedded content is acceptable to PDF/A, so the claim that it’s correct PDF/A is probably spurious. It is, however, definitely legal PDF.

This type of situation helps to show why PDF/A-3 is a bad idea.

10 responses to “When is a PDF not a PDF?

  1. Great post, but I’m not quite if the final comment about PDF/A-3 really applies here. The kind of “embeddding” that is allowed there only refers to file attachments. These are not part of a PDF’s content stream, and are never actually (meant to be) rendered in a PDF viewer anyway.

    The confusing thing about embedded objects in PDF is that they can be used for 2 different things:

    1. The embedded object is part of the file’s content stream and is supposed to be rendered by the viewer (e.g. movies, 3-D objects, sounds etc.). In that case the PDF will contain a movie, sound, etc. annotation that references the embedded object.

    2. The embedded object is *not* part of the content stream and is supposed to be rendered by the viewer. In this case the embedded object is referenced by a file attachment annotation.

    Although I haven’t yet seen actual the PDF/A-3 spec, from what I understand it only allows the second type (e.g. the object is embedded as a file attachment, and its contents are not meant to be rendered by the viewer).

    Your Illustrator example appears to be of the first type. My best guess would be (but you can check this) that in your file the PGF content is referenced by something like a “PGF Annotation” or whatever it may be called, which of course isn’t defined by the PDF filespec. If you have Acrobat Professional you might try analysing the file against the PDF/A-1b profile in Preflight. That will probably give some error message about an unknown or not-allowed annotation type. Or so I hope!

    • Something more obscure than I thought is happening. The PDF/A-1b profile check says that it isn’t isn’t compliant, but the messages are “PDF/A entry missing” and “XMP property is not predefined and no extension schema present.” There aren’t any embedded files, so you’re right on that count.
      This is in fact a different case from PDF/A-3, but I just can’t figure out where it’s stowing the PGF content.

  2. Wow, there is a lot of incorrect stuff here…

    Gary –
    First, if you have a .ai file, then it is an Adobe Illustrator file. The fact that it may use PDF as a wrapper/container is simply a container choice, but it should not be treated as a PDF, especially if you’ve turned of the “compatibility option. This is no different than both OOXML and ODF using ZIP as their container – you can’t use an ODF viewer on an OOXML file. Sure you could use a ZIP viewer, but you may not be able to use the contents.

    Second, even in the case where you are creating a PDF compatibile file, what you are doing is ensuring that the page content can be viewed by a PDF viewer BUT it is still not the actual document. The native content is stored in a special location in the PDF (see PieceInfo, ISO 32000-1:2008, 14.5) it does NOT create an embedded file. Why? Two reasons – 1) Illustrator was doing this long before PDF supported attachments and 2) We don’t want users extracting out the private info.

    All of this, therefore, has NOTHING to do with PDF/A, and certainly not PDF/A-3, which (as I believe I wrote on the other blog entry) is an excellent standard for the use cases for which it was designed.

    Johan –

    There is no such such as “embedded objects in the content stream”. A content stream can only contain defined operators and their operands. Other types of objects, movies, 3D, etc. are all addressed using Annotations and their specific dictionary and data types. Because they are annotations, they are rendered by the viewer according to the rules for annotations.

    Embedded files (aka attachments) are handled in a completely different manner in the PDF as they NOT for display for strictly for using them much as you would attachments in email (or using a PDF as a ZIP-like container aka Portfolios). And you are correct, this is what is permitted by PDF/A-3 because (as you note) their contents are not to be rendered (unless they, themselves, are PDF/A compliant).

    • That’s the information I was looking for. Thanks.

    • I finally had some time to take a detailed look at Leonard’s comments above. First of all he’s right that I mis-used the word content stream; I really meant everything that’s supposed to be rendered (which is something else). I may have created some unnecessary confusion there (was working from the top of my head there without any access to the filespec).

      However, I’m not too sure about some of his observations on the differences between renderable objects and file attachments in PDF. In particular, he states that:

      Objects such as movies, 3D etc. all use Annotations, and are rendered by the viewer according to the rules for annotations.
      Embedded files (which you suggest are identical to attachments) are handled in a completely different manner: they are not used for display /rendering, but used in a way that is similar to attachments in a e-mail message.

      After some digging around in the PDF filespec and hacking away at some sample files, I think the differences between the two are a bit subtler than that.

      Multimedia sample file

      First of all have a look at this sample file that I created some time ago:

      http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/embedded_video_quicktime.pdf

      This PDF 1.7 file was created in Acrobat 9, and if you open it you will see it contains a short Quicktime movie that will play after clicking on it. The movie is (supposed to be) rendered by the viewer, so in principle the observations under 1. above should apply .

      PDF file specification on Multimedia

      So what does the
      PDF Specification (ISO 32000) have to say about this? Section 13.2.1 (Multimedia) says (emphases added by me):

      Rendition actions (see 12.6.4.13, “Rendition Actions”) shall be used to begin the playing of multimedia content.
      A rendition action associates a screen annotation (see 12.5.6.18, “Screen Annotations”) with a rendition (see 13.2.3, “Renditions”).
      Renditions are of two varieties: media renditions (see 13.2.3.2, “Media Renditions”) that define the characteristics of the media to be played, and selector renditions (see 13.2.3.3, “Selector Renditions”) that enables choosing which of a set of media renditions should be played.
      Media renditions contain entries that specify what should be played (see 13.2.4, “Media Clip Objects”), how it should be played (see 13.2.5, “Media Play Parameters”), and where it should be played (see 13.2.6, “Media Screen Parameters”).

      These entities can all be found in my sample file, but more about that later. The actual data for a media object are defined by Media Clip Objects, and more specifically by the media clip data dictionary, which is described in Section 13.2.4.2. This section contains a note, saying that this dictionary ” may reference a URL to a streaming video presentation or a movie embedded in the PDF file“. This is consistent with the description of the /D entry of the media clip data dictionary that follows in Table 274, which states that its value is either a full file specification or a form XObject.

      Analysis of multimedia sample file

      With this in mind I opened my sample file in a text editor to have a look what’s really happening here under the hood. First of all I found this Screen Annotation:

      35 0 obj
      <</A 36 0 R/AP<</N 39 0 R>>/BS<</S/S/Type/Border/W 1>>/Border[0 0 1]/F 4/MK<</I 37 0 R>>/P 25 0 R/Rect[73.1804 360.647 393.18 600.647]/Subtype/Screen/T(Annotation from animation.mov)/Type/Annot>>
      endobj

      It is immediately followed by a Rendition Action (which is referenced by the Screen Annotation above):

      36 0 obj
      <</AN 35 0 R/OP 0/R 40 0 R/S/Rendition>>
      endobj

      Then there’s the rendition object (referenced by the Rendition Action above):

      40 0 obj
      <</C 41 0 R/N(Rendition from animation.mov)/S/MR>>
      endobj

      A Media clip data dictionary (referenced by the rendition object above):

      41 0 obj
      <</CT(video/quicktime)/D 42 0 R/N(Media clip from animation.mov)/P<</TF(TEMPACCESS)>>/S/MCD>>
      endobj
      .

      This is then followed by a File specification dictionary (explained in Section 7.11.3 of the filespec) (referenced by the media clip data dictionary above):

      42 0 obj
      <</EF<</F 43 0 R>>/F()/Type/Filespec/UF()>>
      endobj

      (Finally object 43 0 is a stream object that contains the actual movie data.)

      What’s particularly interesting here is the /EF entry, which means we’re dealing with an embedded file stream here. But according to Leonard embedded files and renderable objects are two completely different things altogether.

      File attachments

      So how are file attachments any different from this? Judging from the specification: not by that much! Section 12.5.6.15 describes the File Attachment Annotation. Just like the Screen Annotation shown above, it references a File specification dictionary which points to a stream object that contains the embedded file data. Section 7.11.4.1 (Embedded File Streams) also describes another method, which uses an EmbeddedFiles entry in the the PDF document’s name dictionary.

      Incidentally also have sample file that contains a file attachment:

      http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/fileAttachment.pdf

      I cannot find any File Attachment Annotation here, but the document does contain an EmbeddedFiles entry:

      33 0 obj
      <</EmbeddedFiles 34 0 R/JavaScript 35 0 R>>
      endobj

      Nevertheless, the attachment is embedded in exactly the same way as the Quicktime movie in our earlier example:

      37 0 obj
      <</Desc()/EF<</F 38 0 R>>/F(KSBASE.WQ2)/Type/Filespec/UF(KSBASE.WQ2)>>
      endobj

      Conclusions?

      Based on the PDF Specification (ISO 32000) and the analysis of 2 sample files, file attachments and multimedia content in PDF appear to be a lot more similar than Leonard is suggesting. In particular:

      At the lowest level, file attachments use a File specification dictionary with an Embedded file stream entry. Multimedia content can be embedded in exactly the same manner!
      Objects that are meant to be rendered by the viewer use Annotations. File attachments can use these as well (File Attachment Annotations).

      Things may be different for (some of the) other annotation types, but I haven’t looked at those in any detail.

  3. great info and we will probably pop for your KS Campaign. How can we best insure the acurracy of a PDF of a painting for digital presevation? I was naive enough tot think all PDF was hi-fidelity. BB

    • Bill,
      In all fairness, Adobe Illustrator files don’t advertise themselves as PDFs. They use the .AI extension. It’s just when you get too clever and use software to analyze the contents of the file that it impersonates a PDF. Letting .AI files have the signature of a PDF was a bad decision on Adobe’s part, at least from a preservation standpoint.

      As for best preservation practices, I’d ask why you need to use PDFs at all for preserving a digital image of a painting. An image format such as TIFF would have fewer complications. If you need to use PDF, make sure to set the highest possible JPEG quality and, on general principles, enable PDF/A. If you use JPEG compression, either in a JPEG or a PDF file, there’s going to be some loss of quality, though you can minimize this by using the highest quality setting.

  4. Henk Vanstappen

    Thanks for this post! DROID seems to make the same ‘mistake’ as JHOVE; it reported an extension mismatch when it encountered .ai files in an archive I was working on. I almost believed it ;)