File identification tools, part 1

This is the start of a series on software for file identification. I’ll be exploring as broad a range as I reasonably can within the blog format, covering a variety of uses. I’m most familiar with the tools for preservation and archiving, but I’ll also look at tools for the end user and at digital forensics (in the proper sense of the word, the resolution of controversies).

We have to start with what constitutes “identifying” a file. For our purposes here, it means at least identifying its type. It can also include determining its subtype and telling you whether it’s a valid instance of the type. You can choose from many options. The simplest approach is to look at the file’s extension and hope it isn’t a lie. A little better is to use software that looks for a “magic number.” This gives a better clue but doesn’t tell you if the file is actually usable. Many tools are available that will look more rigorously at the file. Generally the more thorough a tool is, the narrower the range of files it can identify.

Identification software can be too lax or too strict. If it’s too lax, it can give broken files, perhaps even malicious ones, its stamp of approval. If it’s too severe, it can reject files that deviate from the spec in harmless and commonly accepted ways. Some specifications are ambiguous, and an excessively strict checker might rely on an interpretation which others don’t follow. A format can have “dialects” which aren’t part of the official definition but are widely used. TIFF, to name one example, is open to all of these problems.

Some files can be ambiguous, corresponding to more than one format. Here’s a video with some head-exploding examples. It’s long but worth watching if you’re a format junkie.

The examples in the video may seem far-fetched, but there’s at least one commonly used format that has a dual identity: Adobe Illustrator files. Illustrator knows how to open a .ai file and get the application-specific data, but most non-Adobe applications will see it as a PDF file. Ambiguity can be a real problem when file readers are intentionally lax and try to “repair” a file. Different applications may read entirely different file types and content from the same file, or the same file may have different content on the screen and when printed. So even if an identification tool tells you correctly what the format is, that may not be the whole story. I don’t know of any tool that tries to identify multiple formats for the same file.

Knowing the version and subtype of a file can be important. When an application reads a file in a newer version than it was written for, it may fail unpredictably, and it’s likely to lose some information. Some applications limit their backward compatibility and may be unable to read old versions of a format. Subtypes can indicate a file’s suitability for purposes such as archiving and prepress.

I’ll use the tag “fident” for all posts in this series, to make it easy to grab them together.

Next: The shell file command line tool.

Comments are closed.