The state of file format registries

Looking through UDFR is like walking through a ghost town that still shows many signs of its former promise. The UDFR Final Report (PDF) helps to explain this; it’s a very sad story of a brilliant idea that encountered tons of problems with deadlines and staffing. What’s there is hard to use and, as far as I can tell, isn’t getting used much. I don’t see any signs of recent updates.

The website is challenging for the inexperienced user, but this wouldn’t matter so much if it exposed its raw information so developers could write front ends for specific needs. Chris Prom wrote that “it is a great day for practical approaches to electronic records because all kinds of useful tools and services can and will be developed from the UDFR knowledge base.” But I just can’t see how. I wrote to Stephen Abrams a while back about problems I was encountering (including my inability to log in in Firefox — I’ve since found I can log in in Safari), and his reply gave the sense that the project team had exhausted its resources and funding just in putting the repository up on the Web.

The source code is supposed to be on GitHub, but all that I see there is four projects, three of which are forks of third-party code and the fourth just some OWL ontology files.

If it were possible to access the raw data by RESTful URLs, even that would be something. So far I haven’t found a way to do that.

In fairness, I have to admit I was part of the failure of UDFR’s predecessor GDFR. The scope of the project was too ambitious, and communication between the Harvard and OCLC developers was a problem.

The most successful format registry out there is PRONOM. Programmatic access to its data is provided with DROID. GDFR and UDFR, with “global” and “unified” in their names, both grew from a desire to have a registry that everyone could participate in. PRONOM accepts contributions, but it’s owned by the UK National Archives, and this bothers some people, but it’s the most useful registry there is. The PRONOM site itself expresses the hope that UDFR “will support the requirements of a larger digital preservation community,” and it still would be great if that could happen.

Occasionally some people have discussed the idea of an open wiki for file format information. This would allow more free-form updates than the registries, and if combined with the concept of the semantic wiki, could also be a source of formalized data. I’m inclined to believe that’s the best way to implement an open repository.

7 responses to “The state of file format registries

  1. I share the pain. UDFR is nothing like what I’d hoped, either in its execution or its sustainability plan, of which there appears to be nothing. Like you, I think PRONOM is the best hope, but unfortunately TNA see PRONOM as mainly a plce to hold DROID signatures. They are not interested in (or perhaps have no mechanism for) external input (I offered some corrections to the DOCX entries including links to the ISO standards that have not been acted on in over a year).

    If we could persuade TNA to release the content of PRONOM into some kind of supranational endeavour, we might get somewhere

    Meanwhile, I am hopeful that the iconoclastic Jason Scott from the Archive Team will actually do something useful in November (his declared action month on file formats), although since he came up with the idea, we have discussed what might be done but not yet learned what Jason thinks of it all. And his action rather depends on him!

    Personally, I think some kind of data box on Wikipedia entries on file formats would be the thing; something that can be harvested across into anyone’s private registry, with whatever quality control they choose. I have heard (but don’t know for sure) that Jason doesn’t like Wikipedia, so this may not fly for November. We’ll see.

    I’m occasionally reminded of the fantastic work that Antony Williams did with Chemspider and curating chemistry data on Wikipedia. I’d like to think we can do something similar that would engage those outside the archival community who know about some of these weird formats, and harvest the results into functional services. But Antony had the InChI, the International Chemical identifier, that links Wikipedia entries (and other data sources) to the corresponding entries in Chemspider. Unfortunately that appears to be one of the missing links in the File Format game!

  2. David Clipsham

    Chris,

    Thank you for your comments.

    It is certainly true that The National Archives is currently primarily concerned with developing DROID signatures, for reasons I explained in my recent blog post (http://blog.nationalarchives.gov.uk/blog/an-introduction-to-the-pronom-contribution-model-and-the-signature-developer-role/); my research is necessarily narrow and focused. However, I would like to direct you to the PRONOM release notes (http://www.nationalarchives.gov.uk/aboutapps/pronom/release-notes.xml). As you will see our releases this week include contributions from NARA, MoL, NLNZ, and an independent contributor. All of these contributions have provided very specific information, including file format examples and suggested signature information. We are very open to external contributors, however as my blog post explains, the contributions that are most successful and quicker to be integrated are those that either directly provide signature information, or that make signature information easier to validate.

    Last year you provided a link to the ISO specification for OOXML. I am sure this may be useful, but to illustrate the balance and what we have to consider – the link that you provided was to ISO/IEC 29500-1:2008. This has since been superseded twice, by ISO/IEC 29500-1:2011, and now ISO/IEC 29500-1:2012, so our dilemma is whether to invest time attempting to keep this information up to date, or to focus on what we do best and allow the ephemera to be generated via other means, such as Wikipedia, which interested parties like yourself can easily edit at will. Indeed a Google search for ‘docx specification’ right now provides in the top five: two Microsoft owned specification pages, the Wikipedia article for OOXML, the ECMA standards page, and an ehow.com discussion of the specification. Would a potentially out-of-date link on PRONOM truly add value?

    We spend our time researching signatures and making these available. There is certainly value in seeing this information connected with the other data out there, but this isn’t the role of PRONOM at the minute. Linked data initiatives may better enable this in future, and there is certainly room to see mash-ups connecting this data for the community in the mean time.

    David

  3. David, thanks for the comment. Believe me, I am sympathetic to your predicament. Maybe it’s about managing expectations; if PRONOM is simply about the management of file format signatures, it should at least say so. Instead we have “The online registry of technical information. PRONOM is a resource for anyone requiring impartial and definitive information about the file formats, software products and other technical components required to support long-term access to electronic records and other digital objects of cultural, historical or business value.” And “PRONOM holds information about software products, and the file formats which each product can read and write. A full description of the individual fields used by PRONOM is available in the system documentation.” And “We actively invite the submission of new information for inclusion on PRONOM.” Now you may not be able to keep that up to date, but the crowd could help you.

    At the moment, in terms of a registry for “representation information”, PRONOM is the best in the world. Sadly, it is also woeful.

  4. I’ve been looking at the infoboxes on Wikipedia and plan to write an experimental tool to work with them, but their problem is that they aren’t very granular and really can’t be without reworking some basic concepts. There’s an entry for PDF, for example, but no way to include specific information about PDF 1.4, 1.7, etc.

    With the Linked Data concept, it wouldn’t take a huge software effort for experts in a set of formats to put together machine-accessible information about their area. Some institution with the necessary clout needs to push a consistent ontology and someone needs to list the best data sources, but that would be the extent of the centralized effort. Never mind these grandiose dreams of unified and global registries; what we need is ODFR, the Open Digital Formats Registry!

  5. The way to change the granularity of the Wikipedia infoboxes is simply to add new pages for each version, and put the information on those pages. PDF would be an interesting example to try to take forward. This would require careful engagement with the current Wikipedians interested in those pages, and it would be necessary to get them on board with the general idea of minting pages at this more fine-grained level. The tension between identifying the conceptual entities and writing accessible encyclopaedia pages might mean that the data is better held somewhere else, like WikiData, but it would be good to try.

    I think that, given the way PRONOM is used by TNA, the pragmatic focus on signatures is perfectly understandable. I believe there are some risks associated with that approach (which I’ll try to blog about at some point), but over the short term these risks are probably negligible. I agree that it would be nice if the PRONOM web pages reflected this focus, instead of promising so much.

    If the rest of the community thinks what TNA is doing not sufficient, then I think we need to stop telling them what to do and start showing everyone what we need and start getting it done. All the PRONOM data is freely available – they’ve already ‘release the content’. If the enriched information can be crowdsourced, and is as important as we keep saying it is, then we need to prove it, make it great and *then* lobby for merging some or all of it into PRONOM.

    I’m willing to spend some time making this happen, but I think it will only work if we have a really clear idea of what we want out of it and start fairly small so we can focus on the social side first rather than the technical solution. Why is PRONOM-the-signature-registry not enough? Who needs this extra information and what will it be used for? Who are we? What should we build together? And what’s the first step? What is the one field you’d like to see added to or filled out in PRONOM?

    • I just came across this post by Chris, with your comments on it. I haven’t fully assimilated the discussion yet, but I’m adding this comment just so it can be cross-referenced.

      • Thanks for linking that in, although I just want to emphasis the point in my last comment on that page: we do not have to obsess over getting the data model right before we start working out how to build something together. I’d prefer we found ways to co-author weakly structured data first, and then let our data lead us to the model we need.