File identification tools, part 2: file

A widely available file identification tool is simply called file. It comes with nearly all Linux and Unix systems, including Macintosh computers running OS X. Detailed “man page” documentation is available. It requires using the command line shell, but its basic usage is simple:

file [filename]

file starts by checking for some special cases, such as directories, empty files, and “special files” that aren’t really files but ways of referring to devices. Second, it checks for “magic numbers,” identifiers that are (hopefully) unique to the format near the beginning of the file. If it doesn’t find a “magic” match, it checks if the file looks like a text file, checking a variety of character encodings including the ancient and obscure EBCDIC. Finally, if it looks like a text file, file will attempt to determine if it’s in a known computer language (such as Java) or natural language (such as English). The identification of file types is generally good, but the language identification is very erratic.

The identification of magic numbers uses a set of magic files, and these vary among installations, so running the same version of file on different computers may produce different results. You can specify a custom set of magic files with the -m flag. If you want a file’s MIME type, you can specify --mime, --mime-type, or --mime-encoding. For example:

file --mime xyz.pdf

will tell you the MIME type of xyz.pdf. If it really is a PDF file, the output will be something like

xyz: application/pdf; charset=binary

If instead you enter

file --mime-type xyz.pdf

You’ll get

xyz.pdf: application/pdf

If some tests aren’t working reliably on your files, you can use the -e option to suppress them. If you don’t trust the magic files, you can enter

file -e soft xyz.pdf

But then you’ll get the uninformative

xyz.pdf: data

The -k option tells file not to stop with the first match but to apply additional tests. I haven’t found any cases where this is useful, but it might help to identify some weird files. It can slow down processing if you’re running it on a large number of files.

As with many other shell commands, you can type file --help to see all the options.

file can easily be fooled and won’t tell you if a file is defective, but it’s a very convenient quick way to query the type of a file.

Windows has a roughly similar command line tool called FTYPE, but its syntax is completely different.

Next: DROID and PRONOM. To read this series from the beginning, start here.

One response to “File identification tools, part 2: file

  1. Nice write up. Here are a few more details in case they are of interest.

    The file tool also has some container identification code, e.g. for the binary Microsoft Office formats: cdf.h You can see where this appears in the the sequence of tests here.

    Also, I’m pretty sure FTYPE doesn’t actually identify bytestreams at all. It’s exposes the OS format registry, but in terms of internal format identifiers. So, if you try to run it on a file, it does nothing:

    C:\>ftype example.jpg
    File type ‘this.jpg’ not found or no open command associated with it.

    But if you know the internal format names, it shows you the application that can open it:

    C:\>ftype jpegfile
    jpegfile=%SystemRoot%\System32\rundll32.exe “%ProgramFiles%\Windows Photo Viewer
    \PhotoViewer.dll”, ImageView_Fullscreen %1

    The ASSOC command can be used to look up the association between file extensions and internal format names.