A new home for JHOVE

Over a decade ago, the Harvard University Libraries took me on as a contractor to start work on JHOVE. Later I became an employee, and JHOVE formed an important part of my work. When I left Harvard, I asked for continued “custody” of JHOVE so I could keep maintaining it, and got it. Over time it became less of a priority for me; there’s only so much time you can devote to something when no one’s paying you to do it.

After a long period of discussion, the Open Preservation Foundation (formerly the Open Planets Foundation) has taken up support of JHOVE. In addition to picking up the open source software, it’s resolved copyright issues in the documentation with Harvard, really over boilerplate that no one intended to enforced, but still an issue that had to be cleared.

Stephen Abrams, who was the real father of JHOVE, said, “We’re very pleased to see this transfer of stewardship responsibility for JHOVE to the OPF. It will ensure the continuity of maintenance, enhancement, and availability between the original JHOVE system and its successor JHOVE2, both key infrastructural components in wide use throughout the digital library community.”

JHOVE2 was originally supposed to be the successor to JHOVE, but it didn’t get enough funding to cover all the formats that JHOVE covers, so both are used, and the confusion of names is unfortunate. OPF has both in its portfolio. It doesn’t appear to have forked JHOVE to its Github repository yet, but I’m sure that’s coming soon.

My own Github repository for JHOVE should now be considered archival. Go forth and prosper, JHOVE.

Pono’s file format

I’ve been seeing weirdly intense hostility to the Pono music player and service. A Business Insider article implies that it’s a scheme by Apple to make you buy your music all over again at higher prices. Another article complains that it will hold “only” 1,872 tracks and protests that “the Average person” (their capitalization) doesn’t hear any improvement. I wonder if some of these people are outraged because they’re confusing Pono with Bono and thinking this is the new copy-proof file format which he said Apple is working on.

In fact, Pono isn’t using any new format and isn’t introducing DRM. Its files are in the well-known FLAC format. FLAC stands for “Free Lossless Audio Codec.” The term technically refers only to the codec, not the container, but it’s usually delivered in a “Native FLAC” container. It can also be delivered in an Ogg container, providing better metadata support and slightly larger files.

The “lossless” part of the name refers to FLAC’s compression. MP3 uses lossy compression, which removes some information, sacrificing a little audio quality to make the file smaller. FLAC delivers larger files, giving better quality and a larger file size for the same sampling rate and bit resolution. According to CNET, “Pono’s recordings will range from CD-quality 16-bit/44.1kHz to 24-bit/192kHz “ultra-high resolution.” 96 kilohertz (dividing 192 by 2 per the Nyquist theorem) is way beyond the threshold of human hearing, so it’s understandable that people are skeptical about whether it offers any benefit over a lower sampling rate. Frequencies that high are normally filtered out.

FLAC is non-proprietary and DRM-free, and it has an open source reference implementation. Someone could put FLAC into a DRM container, but then why not use a proprietary encoding? Using FLAC is a step forward from the patent-encumbered MP3, with license requirements that effectively lock out free software.

iTunes doesn’t support FLAC files, so the Business Insider claim that Pono is Apple’s way of making you buy music over again is idiotic. It’s like saying Windows 8 is an Apple scheme to make you buy new software.

As the number of gigabytes you can stick in your pocket keeps growing, the need for compression decreases. For many people, amount of music storage takes priority over improved sound quality, but some will pay for a high-end player that gives them the best sound possible. I don’t get why this infuriates so many critics. At any rate, the file format shouldn’t scare anyone.

For more discussion of FLAC as it relates to Pono, see “What is FLAC? The high-def MP3 explained” on CNET’s site; the headline is totally wrong, but the article itself is good.

Article on PDF/A validation with JHOVE

An article by Yvonne Friese does a good job of explaining the limitations of JHOVE in validating PDF/A. At the time that I wrote JHOVE, I wasn’t aware how few people had managed to write a PDF validator independent of Adobe’s code base; if I’d known, I might have been more intimidated. It’s a complex job, and adding PDF/A validation as an afterthought added to the problems. JHOVE validates only the file structure, not the content streams, so it can miss errors that make a file unusable. Finally, I’ve never updated JHOVE to PDF 1.7, so it doesn’t address PDF/A-2 or 3.

I do find the article flattering; it’s nice to know that even after all these years, “many memory institutions use JHOVE’s PDF module on a daily basis for digital long term archiving.” The Open Preservation Foundation is picking up JHOVE, and perhaps it will provide some badly needed updates.

The misuses of HTML frames

HTML framesets have some good uses, such as including third-party content. They also have misuses, such as disguising third-party involvement.

Recently I needed to set up domain forwarding for a subdomain registered with Godaddy. (The choice of registrar wasn’t my fault.) A couple of options were available, including one that claimed to guarantee that the subdomain would persist through navigation in the address bar. That sounded like a good thing, so I picked it.

At first it seemed to work fine; but when I tried to use the URL of an image on the site, there were weird errors. I soon found out what was going on: Godaddy was wrapping every page referenced by the subdomain in a frameset! This looks like a duck and clicks like a duck, but it isn’t one, and anything that tries to treat HTML as a JPEG file isn’t going to work very well.

Stack Overflow has several reports of people being bitten by this:

Frame wrapping is a good-enough solution for some cases, but when you aren’t told it’s happening, that’s a seriously wrong way to do it. It’s also a security concern, since your domain points at an IP address that you don’t control, and only indirectly at your own site.

This is a blog on file formats, not on irresponsible domain registrars, so the moral here is to realize that framesets aren’t a completely transparent way to provide third-party content. It’s fine to use them, but only if you’re aware that the frameset host and the frame provider are active partners.

Open Planets Foundation is now Open Preservation Foundation

The Open Planets Foundation is now the Open Preservation Foundation. This name change reflects its function; the old name grew out of the Planets project and never really made sense.

For the present, it’s still found on the Internet as openplanetsfoundation.org.

The return of music DRM?

U2, already the most hated band in the world thanks to its invading millions of iOS devices with unsolicited files, isn’t stopping. An article on Time‘s website tells us, in vague terms, that

Bono, Edge, Adam Clayton and Larry Mullen Jr believe so strongly that artists should be compensated for their work that they have embarked on a secret project with Apple to try to make that happen, no easy task when free-to-access music is everywhere (no) thanks to piracy and legitimate websites such as YouTube. Bono tells TIME he hopes that a new digital music format in the works will prove so irresistibly exciting to music fans that it will tempt them again into buying music—whole albums as well as individual tracks.

It’s hard to read this as anything but an attempt to bring digital rights management (DRM) back to online music distribution. Users emphatically rejected it years ago, and Apple was among the first to drop it. You haven’t really “bought” anything with DRM on it; you’ve merely leased it for as long as the vendor chooses to support it. People will continue to break DRM, if only to avoid the risk of loss. The illegal copies will offer greater value than legal ones.

It would be nice to think that what U2 and Apple really mean is just that the new format will offer so much better quality that people will gladly pay for it, but that’s unlikely. Higher-quality formats such as AAC have been around for a long time, and they haven’t pushed the old standby MP3 out of the picture. Existing levels of quality are good enough for most buyers, and vendors know it.

Time implies that YouTube doesn’t compensate artists for their work. This is false. They often don’t bother with small independent musicians, though they will if they’re reminded hard enough (as Heather Dale found out), but it’s hard to believe that groups with powerful lawyers, such as U2, aren’t being compensated for every view.

DRM and force-feeding of albums are two sides of the same coin of vendor control over our choices. This new move shouldn’t be a surprise.

Best viewed with a big-name browser

A few websites refuse to present content if you use a browser other than one of the four or so big-name ones.

An "unsupported browser" message from Apple's support website

The example shown is what I got when I accessed Apple’s support site with iCab, a relatively obscure browser which I often use. Many of Google’s pages also refuse to deliver content to iCab.

There is a real problem that JavaScript isn’t standardized, and it’s necessary to test with each browser to be confident that a page will work properly. However, if a page sticks with the basics of JavaScript and isn’t trying to do animations, video, or other cutting-edge effects, then any reasonably up-to-date implementation of JavaScript should be able to handle it. It’s reasonable to display a warning if the browser is an untested one, but there’s no reason to block it.

Browsers can impersonate other browsers by setting the User-Agent header, and small-name browsers usually provide that option for getting around these problems. After a couple of tries with iCab, I was able to get through by impersonating Safari. Doing this also has an advantage for privacy; identifying yourself with a little-used browser can greatly contribute to unique identification when you may want anonymity. From the standpoint of good website practices, though, a site shouldn’t be locking browsers out unless there’s an unusual need. Web pages should follow standards so that they’re as widely readable as possible. This is especially important with a “contact support” page.

Apple and Google both are browser vendors. Might we look at this as a way to make entry by new browsers more difficult?

The animated GIF is the new blink tag

In the early days of HTML, the most hated tag was the <blink> tag, which made text under it blink. There were hardly any sensible uses for it, and a lot of browsers now disable it. I just tested it in this post, and WordPress actually deleted the tag from my draft when I tried to save it. (I approve!)

Today, though, the <blink> tag isn’t annoying enough. Now we have the animated GIF. It’s been around since the eighties, but for some reason it’s become much more prevalent recently. It’s the equivalent of waving a picture in your face while you’re trying to read something.

I can halfway understand it when it’s done in ads. Advertisers want to get your attention away from the page you’re reading and click on the link to theirs. What I don’t understand is why people use it in their own pages and user icons. It must be a desire to yell “Look how clever I am!!!” over and over again as the animation cycles.

Fortunately, some browsers provide an option to disable it. Firefox used to let you stop it with the ESC key, but last year removed this feature.

If you think that your web page is boring and adding some animated GIFs is just what’s needed to bring back the excitement — Don’t. Just don’t.

Update: I just discovered that a page that was driving me crazy because even disabling animated GIFs wouldn’t stop it was actually using the <marquee> tag. I believe that tag is banned by the Geneva Convention.

Canvas fingerprinting, the technical stuff

The ability of websites to bypass privacy settings with “canvas fingerprinting” has caused quite a bit of concern, and it’s become a hot topic on the Code4lib mailing list. Let’s take a quick look at it from a technical standpoint. It is genuinely disturbing, but it’s not the unstoppable form of scrutiny some people are hyping it as.

The best article to learn about it from is “Pixel Perfect: Fingerprinting Canvas in HTML5,” by Keaton Mowery and Hovav Shacham at UCSD. It describes the basic technique and some implementation details.

Canvas fingerprinting is based on the <canvas> HTML element. It’s been around for a decade but was standardized for HTML5. In itself, <canvas> does nothing but define a blank drawing area with a specified width and height. It isn’t even like the <div> element, which you can put interesting stuff inside; if all you use is unscripted HTML, all you get is some blank space. To draw anything on it, you have to use JavaScript. There are two APIs available for this: the 2D DOM Canvas API and the 3D WebGL API. The DOM API is part of the HTML5 specification; WebGL relies on hardware acceleration and is less widely supported.

Either API lets you draw objects, not just pixels, to a browser. These include geometric shapes, color gradients, and text. The details of drawing are left to the client, so they will be drawn slightly differently depending on the browser, operating system, and hardware. This wouldn’t be too exciting, except that the API can read the pixels back. The getImageData method of the 2D context returns an ImageData object, which is a pixel map. This can be serialized (e.g., as a PNG image) and sent back to the server from which the page originated. For a given set of drawing commands and hardware and software configuration, the pixels are consistent.

Drawing text is one way to use a canvas fingerprint. Modern browsers use a programmatic description of a font rather than a bitmap, so that characters will scale nicely. The fine details of how edges are smoothed and pixels interpolated will vary, perhaps not enough for any user to notice, but enough so that reading back the pixels will show a difference.

However, the technique isn’t as frightening as the worst hype suggests. First, it doesn’t uniquely identify a computer. Two machines that have the same model and come from the same shipment, if their preinstalled software hasn’t been modified, should have the same fingerprint. It has to be used together with other identifying markers to narrow down to one machine. There are several ways for software to stop it, including blocking JavaScript from offending domains and disabling part or all of the Canvas API. What gets people upset is that neither blocking cookies nor using a proxy will stop it.

Was including getImageData in the spec a mistake? This can be argued both ways. Its obvious use is to draw a complex canvas once and then rubber-stamp it if you want it to appear multiple times; this can be faster than repeatedly drawing from scratch. It’s unlikely, though, that the designers of the spec thought about its privacy implications.

