Category Archives: commentary

“High-res audio”

We hear a lot about “high-res audio” these days. Sound digitized at 192,000 samples per second must be a lot better than the usual 44,000, right? Well, maybe not.

We can hear sounds only in a certain frequency range. The popular rule of thumb is 20 to 20,000 Hertz, though there’s a lot of variation among people. Not a lot of people can hear anything higher than 20,000.
Continue reading

Which way to personal digital preservation?

Today I came across a video from the Library of Congress on “Why digital preservation is important for you.” Anyone following its advice will certainly have a better chance of keeping their files alive and organized for a long time. The only question is: Who’s going to follow that advice?
Continue reading

File fuzzing

Recently I came across the term “fuzzing” for intentionally damaging files to test the software that reads them. Most of the material I’ve found doesn’t provide a useful introduction; they assume that if you know the term, you already understand something about it. One good article is “Fuzzing — Mutation vs. Generation” on the Infosec website. According to that article, fuzzing denotes the response to file changes rather than the changes themselves, but I’m seeing the term used mostly in the latter sense.
Continue reading

A personal case study in digital obsolescence

Pegasus Winners coverThe nineties saw huge changes in personal computing, as operating systems became more complex, Internet connections became common, and the World Wide Web appeared. This meant a lot of instability as formats came and went.

This past weekend I discovered a CD-ROM in my closet with the production files for a small-run songbook, The Pegasus Winners (optimistically called “Volume 1”), that I produced in 1994. The good news is that the CD is still readable. The bad news is that I can’t read most of the files. The not-so-bad news is that I could probably recover them with moderate effort.
Continue reading

Unicode security mechanisms

Unicode is a great thing, but sometimes its thoroughness poses problems. Different character sets often include characters that look exactly like common ASCII characters in most fonts, and these can be used to spoof domain names. Sometimes this is called a homograph attack or script spoofing. For instance, someone might register the domain gοοgle.com, which looks a lot like “google.com,” but actually uses the Greek letter omicron instead of the Roman letter o. (Search this page in your browser for “google” if you don’t believe me.) Such tricks could lure unwary users into a phishing site. A real-life example, which didn’t even require more than ASCII, was a site called paypaI.com — that’s a capital I instead of a lower-case L, and they look the same in some fonts. That was way back in 2000.
Continue reading

DOTS: Almost a datalith

A lot of people in digital preservation are convinced a “digital dark age” is nothing to worry about. I’ve consistently disagreed with this. The notion that archivists will replace outdated digital media every decade or two through the centuries is a pipe dream. Records have always gone through periods of neglect, and they will in the future. Periods of unrest will happen; authorities will try to suppress inconvenient history; groups like Daesh will set out to destroy everything that doesn’t match their worldview; natural disasters will disrupt archiving.

I’ve proposed the idea of a “datalith,” a data record made out of rock or equivalent material, optically readable and self-explanatory assuming a common language survives. DOTS, Digital Optical Technology System, is burned on tape rather than engraved in stone, but in every other respect it matches my vision of a datalith. It can store digital images in any format but also allows them to be recorded as a visual representation. The Long Now Foundation explains:
Continue reading

Iterating a directory in command line Tika

Apache Tika is best used as a library to wrap your own code around. Its GUI application is a toy, and its command line version isn’t all that great either. The command line can be improved with a little scripting, though.
Continue reading

The FLIF format

flif logoNew image file formats keep turning up, taking advantage of advances in compression technology. One of the latest is FLIF, Free Lossless Image Format. It claims to outcompress PNG, lossless JPEG2000, lossless WebP, and lossless BPG. Though it has only a lossless mode, it claims that “FLIF works well on any kind of image, so the end-user does not need to try different algorithms and parameters.”
Continue reading

The coming of WebP (or not)

The WebP image format has been around for about five years, but till recently it’s been mostly a curiosity. I last blogged about it in 2013, when it didn’t have very wide support. Since then most browsers have adopted it, and now Google+ is making more use of it (no surprise, since Google is the format’s principal backer). It promises smarter lossy compression than JPEG and smaller file sizes for the same image quality.
Continue reading

A sock puppet mystery

The SourceForge repository for JHOVE (which is, by the way, obsolete; here’s the active repository) includes three short reviews which give it five stars and make very generic and identical comments. They’re dated on three successive days. Those are clear signs of sock-puppet accounts.

I can understand why people post glowing but fake reviews to their own project sites, but really, I’m not responsible for these, and I was the only person working on JHOVE at the time, so I can’t imagine who else had an incentive to promote it. Checking on one of these accounts, “rusik1978,” I find similar reviews on many other SourceForge projects. If they linked back to something it would make sense, but they don’t.

I’ve learned from this that sock puppet reviews don’t necessarily prove that the project owner is faking praise. Maybe that’s the point, to make it harder to identify the actual paid reviews?