Format conformity

By design JHOVE measures strict conformity to file format specifications. I’ve never been convinced this is the best way to measure a file’s viability or even correctness, but it’s what JHOVE does, and I’d just create confusion if I changed it now.

In general, the published specification is the best measure of a file’s correctness, but there are clearly exceptions, and correctness isn’t the same as viability for preservation. Let’s look at the rather extreme case of TIFF.

The current official specification of TIFF is Revision 6.0, dated June 3, 1992. The format hasn’t changed a byte in over 20 years — except that it has.

The specification says about value offsets in IFDs: “The Value is expected to begin on a word boundary; the corresponding Value Offset will thus be an even number.” This is a dead letter today. Much TIFF generation software freely writes values on any byte boundary, and just about all currently used readers accept them. JHOVE initially didn’t accept files with odd byte alignment as well-formed, but after numerous complaints it added a configuration option to allow them.

Over the years a body of apocrypha has grown around TIFF. Some comes from Adobe, some not. The titles of the ones from Adobe don’t clearly mark them as revisions to TIFF, but they are. The “Adobe PageMaker® 6.0 TIFF Technical Notes,” September 14, 1995, define the important concept of SubIFD, among other changes. The “Adobe Photoshop® TIFF Technical Notes,” March 22, 2002, define new tags and forms of compression. The “Adobe Photoshop® TIFF Technical Note 3,” April 8, 2005, adds new floating point types. The last one isn’t available, as far as I can tell, on Adobe’s own website, but it’s canonical.

Then there’s material without official Adobe approval. The JPEG compression defined in the 2002 tech notes is an official acceptance of a 1995 draft note that had already gained wide acceptance.

What’s the best measure of a TIFF file? That it corresponds strictly to TIFF 6.0? To 6.0 plus a scattered set of tech notes? Or that it’s processed correctly by LibTiff, a freely available and very widely used C library? To answer the question, we have to specify: Best for what? If we’re talking about the best chance of preservation, what scenarios are we envisioning?

One scenario amounts to a desert-island situation in which you have a specification, some files that you need to render, and a computer. You don’t have any software to go by. In this case, conformity to the spec is what you need, but it’s a rather unlikely scenario. If all existing TIFF readers disappear, things have probably gone so far that no one will be motivated to write a new one.

It’s more likely that people a few decades in the future will scramble to find software or entire old computers that will read obsolete formats. This doesn’t necessarily mean today’s software, but what we can read today can be a pretty good guide to what will be readable in the future. Insisting on conformity to the spec may be erring to the safe side, but if it excludes a large body of valuable files, it’s not a good choice.

Rather than insisting solely on conformity to a published standard, preservation-worthy files need to be measured by a balance between accepting files that will cause reading problems down the road and rejecting files that won’t. Multiple factors come into consideration, of which the spec is just one.

One response to “Format conformity

  1. It also depends if you’re in control of the creation of the files as well as ingesting them. If you’re creating them (or having them created for you under some contractural arrangement), you might want to be me more specific about the outputs you’ll accept (say in JPEG2000, lossless or lossy compression, what degree of compression) depending on the purpose of the files. If you ahve no choice about accepting the files, it might be useful to know in what ways they are non-conformant, but you probably can’t not ingest them into your preservation system