After many years of waiting, California Digital Library has announced the availability of the Unified Digital Format Registry (UDFR). This is intended as a single, comprehensive source for information about file formats, with particular attention to requirements for archiving and preservation. It draws on the well-regarded PRONOM from the UK’s National Archives and the ill-fated Global Digital Format Registry (GDFR) from a number of institutions, including OCLC and Harvard. (Full disclosure: I worked on the latter.)
The site isn’t intuitive. The first thing you should do is get the manual (PDF) and read at least the “Getting started” section. Even that’s interspersed with information for advanced users, so you may prefer to follow along with my attempts to dig into it:
First, you have to choose a knowledge base from the “Select knowledge base” pane. The only one that’s currently useful is “UDFR – Start Here”. Clicking it will bring up a Navigation pane (titled “Navigate:classes”). If you’re interested in looking for information on a particular format, click on “Abstract base.” This will bring up a “Comments, Descriptions, and Notes” pane in the center with a statement that there aren’t any, followed by some mysterious information. Don’t worry about it. Go to the search box in the top left pane and enter a search string, such as “jpeg”. This will get you a Resource List with a bunch (currently eleven) of named resources that relate to JPEG. You’ll notice none of them relate to JPEG2000; evidently it does a full-word search. A search for “JPEG2000″ works if that’s what you’re looking for.
None of these results is actually a format. Ten are marked as “process” and one as “compression.” So let’s click on the first one (“Acrobat 5.0 creation process for JPEG File Interchange Format 1.01″). This gives you a “Review” which tells you, among other things, that the process’s output format is “JPEG File Interchange Format, version 1.01″. Click on that, and hurrah! You get a description of that format. This includes lots of interesting fields, including byte order (big/little endian), form (binary/text), primary genre (still image, in this case), release date, signatures, version, previous version, and more.
Not all formats are equally well documented at this point, which isn’t surprising given that it’s just been announced. I tried searching for “ascii” and this time did find some format references without an extra step. I clicked on “8-bit ASCII Text,” since we all know there’s no such thing, and got a very minimal set of information. The real thing, “7-bit ASCII Text” (which is more properly known as USASCII), doesn’t have any more information.
I suppose I should really call up Stephen Abrams and volunteer to fill some of these in. But not today; it’s the 4th of July. In the meantime, I hope this makes some people’s initial plunge into UDFR a little easier.