Update on JHOVE

JHOVE logoYesterday the Open Preservation Foundation held a webinar on JHOVE, presented by Carl Wilson. I was really impressed by the progress he’s made there, and any rumors of JHOVE’s death (including ones I may have contributed to) have been greatly exaggerated.

The big changes include reorganizing the code under Maven and making installation more straightforward. These are both badly needed changes. I never had the opportunity to do them at Harvard, and when I took the code over for a while after leaving there, I focused on fixing bugs rather than fixing the design.

In my comments during the webinar, I pointed out the importance of Stephen Abrams’ contribution, which a lot of people don’t remember. I didn’t create JHOVE; he did. The core application and design principles were already in place when I entered the project. OPF will, I’m sure, give him the credit he deserves.

Possible book on digital preservation tools

Update: It’s clear from the small response that the necessary level of interest isn’t there. Oh, well, that’s what testing the waters is for.

I’m getting the urge to write another book, going the crowdfunding route which has worked twice for me and my readers. My earlier Files that Last got good responses, though the “digital preservation for everygeek” audience proved not to be huge. Tomorrow’s Songs Today, a non-tech book, got more recognition and additional confirmation that book crowdfunding works. This time I’m aiming squarely at the institutions that engage in preservation — libraries, archives, and academic institutions — and proposing a reference on the software tools for preservation. The series I’ve been running on file identification tools was an initial exploration of the idea.

In the book, I’ll significantly expand these articles as well as covering a broader scope. Areas to cover will include:

  • File identification
  • Metadata formats
  • Detection of problems in files
  • Provenance management
  • The OAIS reference model
  • Repository creation and management
  • Keeping obsolescent formats usable

Continue reading

The end of Flash?

There’s a growing call to dump Adobe Flash. With alternatives based on HTML5 becoming standardized, many tech experts think a plugin that has often been a source of security holes is a liability.

Security reporter Brian Krebs has written several articles on Flash:

Browser plugins are favorite targets for malware and miscreants because they are generally full of unpatched or undocumented security holes that cybercrooks can use to seize complete control over vulnerable systems. The Flash Player plugin is a stellar example of this: It is among the most widely used browser plugins, and it requires monthly patching (if not more frequently).

It’s also not uncommon for Adobe to release emergency fixes for the software to patch flaws that bad guys started exploiting before Adobe even knew about the bugs.

In 2010, Steve Jobs wrote an open letter explaining why Apple hasn’t supported Flash on iOS:

Adobe’s Flash products are 100% proprietary. They are only available from Adobe, and Adobe has sole authority as to their future enhancement, pricing, etc. While Adobe’s Flash products are widely available, this does not mean they are open, since they are controlled entirely by Adobe and available only from Adobe. By almost any definition, Flash is a closed system.

Apple has many proprietary products too. Though the operating system for the iPhone, iPod and iPad is proprietary, we strongly believe that all standards pertaining to the web should be open. Rather than use Flash, Apple has adopted HTML5, CSS and JavaScript – all open standards. Apple’s mobile devices all ship with high performance, low power implementations of these open standards. HTML5, the new web standard that has been adopted by Apple, Google and many others, lets web developers create advanced graphics, typography, animations and transitions without relying on third party browser plug-ins (like Flash). HTML5 is completely open and controlled by a standards committee, of which Apple is a member.

Continue reading

Photographic forensics

The FotoForensics site can be a valuable tool in checking the authenticity of an image. It’s easy to alter images with software and try to fool people with them. FotoForensics uses a technique called Error Level Analysis (ELA) to identify suspicious areas and highlight them visually. Playing with it a bit shows me that it takes practice to know what you’re seeing, but it’s worth knowing about if you ever have suspicions about an image.

New Horizons Pluto picture with cartoon PlutoLet’s start with an obvious fake, the iconic New Horizons image of Pluto with the equally iconic Disney dog superimposed on it. The ELA analysis shows a light-colored boundary around most of the cartoon, and the interior has very dark patches. The edge of the dwarf planet has a relatively dim boundary. According to the ELA tutorial, “Similar edges should have similar brightness in the ELA result. All high-contrast edges should look similar to each other, and all low-contrast edges should look similar. With an original photo, low-contrast edges should be almost as bright as high-contrast edges.” So that’s a confirmation that the New Horizons picture has been subtly altered.

Let’s compare that to an analysis of the unaltered image. The “heart” stands out as a dark spot on the ELA image, but its edges aren’t noticeably brighter than the edges of the planet’s (OK, “dwarf planet”) image. The tutorial says that “similar textures should have similar coloring under ELA. Areas with more surface detail, such as a close-up of a basketball, will likely have a higher ELA result than a smooth surface,” so it seems to make sense that the smooth heart (which is something like an ice plain) looks different.
Continue reading

PDF 2.0

As most people who read this blog know, the development of PDF didn’t end with the ISO 32000 (aka PDF 1.7) specification. Adobe has published three extensions to the specification. These aren’t called PDF 1.8, but they amount to a post-ISO version.

The ISO TC 171/SC 2 technical committee is working on what will be called PDF 2.0. The jump in major revision number reflects the change in how releases are being managed but doesn’t seem to portend huge changes in the format. PDF is no longer just an Adobe product, though the company is still heavily involved in the spec’s continued development. According to the PDF Association, the biggest task right now is removing ambiguities. The specification’s language will shift from describing conforming readers and writers to describing a valid file. This certainly sounds like an improvement. The article mentions that several sections have been completely rewritten and reorganized. What’s interesting is that their chapter numbers have all been incremented by 4 over the PDF 1.7 specification. We can wonder what the four new chapters are.

Leonard Rosenthol gave a presentation on PDF 2.0 in 2013.

As with many complicated projects, PDF 2.0 has fallen behind its original schedule, which expected publication in 2013. The current target for publication is the middle of 2016.

veraPDF validator

The veraPDF Consortium has announced a public prototype of its PDF validation software.

It’s ultimately intended to be “the definitive open source, file-format validator for all parts and conformance levels of ISO 19005 (PDF/A)”; however, it’s “currently more a proof of concept than a usable file format validator.”

New developments in JPEG

A report from the 69th meeting of the JPEG Committee, held in Warsaw in June, mentions several recent initiatives. The descriptions have a rather high buzzword-to-content ratio, but here’s my best interpretation of what I think they mean. What’s usually called “JPEG” is one of several file formats supported by the Joint Photographic Experts Group, and JFIF would be a more precise name. Not every format name that starts with JPEG refers to “JPEG” files, but if I refer to JPEG without further qualification here, it means the familiar format.
Continue reading

File identification tools, part 9: JHOVE2

The story of JHOVE2 is a rather sad one, but I need to include it in this series. As the name suggests, it was supposed to be the next generation of JHOVE. Stephen Abrams, the creator of JHOVE (I only implemented the code), was still at Harvard, and so was I. I would have enjoyed working on it, getting things right that the first version got wrong. However, Stephen accepted a position with the California Digital Library (CDL), and that put an end to Harvard’s participation in the project. I thought about applying for a position in California but decided I didn’t want to move west. I was on the advisory board but didn’t really do much, and I had no involvement in the programming. I’m not saying I could have written JHOVE2 better, just explaining my relationship to the project. JHOVE2 logo

The institutions that did work on it were CDL, Portico, and Stanford University. There were two problems with the project. The big one was insufficient funding; the money ran out before JHOVE2 could boast a set of modules comparable to JHOVE. A secondary problem was usability. It’s complex and difficult to work with. I think if I’d been working on the project, I could have helped to mitigate this. I did, after all, add a GUI to JHOVE when Stephen wasn’t looking.

JHOVE has some problems that needed fixing. It quits its analysis on the first error. It’s unforgiving on identification; a TIFF file with a validation error simply isn’t a TIFF file, as far as it’s concerned. Its architecture doesn’t readily accommodate multi-file documents. It deals with embedded formats only on a special-case basis (e.g., Exif metadata in non-TIFF files). Its profile identification is an afterthought. JHOVE2 provided better ways to deal with these issues. The developers wrote it from scratch, and it didn’t aim for any kind of compatibility with JHOVE.
Continue reading

Redecoration in progress

I’m working on some changes in the appearance of this blog. You may see weird things while this is happening. Sorry for the inconvenience.