Category Archives: commentary

Photographic forensics

The FotoForensics site can be a valuable tool in checking the authenticity of an image. It’s easy to alter images with software and try to fool people with them. FotoForensics uses a technique called Error Level Analysis (ELA) to identify suspicious areas and highlight them visually. Playing with it a bit shows me that it takes practice to know what you’re seeing, but it’s worth knowing about if you ever have suspicions about an image.

New Horizons Pluto picture with cartoon PlutoLet’s start with an obvious fake, the iconic New Horizons image of Pluto with the equally iconic Disney dog superimposed on it. The ELA analysis shows a light-colored boundary around most of the cartoon, and the interior has very dark patches. The edge of the dwarf planet has a relatively dim boundary. According to the ELA tutorial, “Similar edges should have similar brightness in the ELA result. All high-contrast edges should look similar to each other, and all low-contrast edges should look similar. With an original photo, low-contrast edges should be almost as bright as high-contrast edges.” So that’s a confirmation that the New Horizons picture has been subtly altered.

Let’s compare that to an analysis of the unaltered image. The “heart” stands out as a dark spot on the ELA image, but its edges aren’t noticeably brighter than the edges of the planet’s (OK, “dwarf planet”) image. The tutorial says that “similar textures should have similar coloring under ELA. Areas with more surface detail, such as a close-up of a basketball, will likely have a higher ELA result than a smooth surface,” so it seems to make sense that the smooth heart (which is something like an ice plain) looks different.

The general color effect seems to be what the tutorial calls “rainbowing,” described as “a visible separation between the luminance and chrominance channels as a blue/purple/red coloring.” It says this often indicates an Adobe product was used to save the image, but doesn’t tell us anything about whether the image has been altered.

Let’s look now at the analysis of a picture from my camera. It has objects in sharp contrast with the background and a multicolored carpet pattern. The edges look uniformly light; the carpet shows a certain amount of rainbowing. If I had reason to be suspicious about the picture, I don’t know whether this would increase my confidence or not. If you looked at a lot of fakes and real pictures, you could probably start to tell after a while. The tutorial page on mistakes mentions some ways that people can misread the results.

This could be a useful tool for people who manage images from uncertain sources.

TIFF/A

TIFF has been around for a long time. Its latest official specification, TIFF 6.0, dates from 1992. The format hasn’t held still for 23 years, though. Adobe has issued several “technical notes” describing important changes and clarifications. Software developers, by general consensus, have ignored the requirement that value offsets have to be on a word boundary, since it’s a pointless restriction with modern computers. Private tags are allowed, and lots of different sources have defined new tags. Some of them have achieved wide acceptance, such as the TIFFTAG_ICCPROFILE tag (34675), which fills the need to associate ICC color profiles with images. Many applications use the EXIF tag set to specify metadata, but this isn’t part of the “standard” either.

In other words, TIFF today is the sum of a lot of unwritten rules.

It’s generally not too hard to deal with the chaos and produce files that all well-known modern applications can handle. On the other hand, it’s easy to produce a perfectly legal TIFF file that only your own custom application will handle as you intended. People putting files into archives need some confidence in their viability. Assumptions which are popular today might shift over a decade or two. Variations in metadata conventions might cause problems.

A restricted subset of PDF, called PDF/A, specifies rules that help to guarantee the long-term readability of PDF files that follow them. A group of academic archivists has begun work on an initiative to do the same for TIFF, calling it TIFF/A by analogy. It’s supported by the PREFORMA project.

TIFF/A logoSo far it’s still in the stage of gathering support. The site gives September 1, 2015 as the date to kick off discussions and March 1, 2016 as the target for an ISO submission. A white paper by Peter Fornaro and Lukas Rosenthaler at the University of Basel discusses the technical issues. It’s obviously just a first shot at the problem. At one point it states that “it is obvious that the TIFF-extensions [anything which isn’t baseline TIFF] should not be used for long term archival,” but then it admits the ICC profile tag (in fact, says it’s mandatory for color images) and the EXIF tag, and allows the IPTC metadata tag though it doesn’t recommend it.

The white paper doesn’t address the word alignment issue. This is something the TIFF/A consortium needs to take a stand on; either it should repudiate the word alignment requirement, or it should affirm it. If it sticks to strict conformance to the spec, a great many files won’t qualify.

TIFF/A presents a different set of challenges from PDF/A. On the one hand, TIFF is a vastly simpler format than PDF. (Trust me; I’ve written validators for both.) On the other hand, PDF is an ISO standard that hasn’t experienced two decades of entropy. I wish the people involved every success.

Funding for preservation software development

The Open Preservation Foundation (formerly the Open Planets Foundation) is launching a new model for funding the development of preservation-related software. Quoting from the announcement:

‘Over the last year the OPF has established a solid foundation for ensuring the sustainability of digital preservation technology and knowledge,’ explains Dr. Ross King, Chair of the OPF Board. ‘Our new strategic plan was introduced in November 2014 along with community surveys to establish the current state of the art. We developed our annual plan in consultation with our members and added JHOVE to our growing software portfolio. The new membership and software supporter models are the next steps towards realising our vision and mission.’ …

The software supporter model allows organisations to support individual digital preservation software products and ensure their ongoing sustainability and maintenance. We are launching support for JHOVE based on its broad adoption and need for active stewardship. It is also a component in several leading commercial digital preservation solutions. While it remains fully open source, supporters can steer our stewardship and maintenance activities and receive varying levels of technical support and training.

JHOVE logoI have a selfish personal interest in spreading the word. At the moment, I’m between contracts, and I wouldn’t mind getting some funding from OPF to resume development work on JHOVE. I know its code base better than anyone else, I worked on it without pay as a hobby for a year or so after leaving Harvard, and I’d enjoy working on it some more if I could just get some compensation. This is possible, but only if there’s support from outside.

US libraries have been rather insular in their approach to software development. They’ll use free software if it’s available, but they aren’t inclined to help fund it. If they could each set aside some money for this purpose, it would help assure the continued creation and maintenance of the open source software which is important to their mission.

How about it, Harvard?

Dataliths vs. the digital dark age

Digital technology has allowed us to store more information at less cost than ever before. At the same time, it’s made this information very fragile in the long term. A book can sit in an abandoned building for centuries and still be readable. Writing carved in stone can last for thousands of years. The chances that your computer’s disk will be readable in a hundred years are poor, though. You’ll have to go to a museum for hardware and software to read it. Once you have all that, it probably won’t even spin up. If it does, the bits may be ruined. In five hundred years, its chance of readability will be essentially zero.

Archivists are aware of this, of course, and they emphasize the need for continual migration. Every couple of decades, at least, stored documents need to be moved to new media and perhaps updated to a new format. Digital copies, if made with reasonable precautions, are perfect. This approach means that documents can be preserved forever, provided the chain never breaks.

Fortunately, there doesn’t have to be just one chain. The LOCKSS (lots of copies keeps stuff safe) principle means that the same document can be stored in archives all over the world. As long as just one of them keeps propagating it, the document will survive.

Does this make us really safe from the prospect of a digital dark age? Will a substantial body of today’s knowledge and literature survive until humans evolve into something so different that it doesn’t matter any more? Not necessarily. To be really safe, information needs to be stored in a form that can survive long periods of neglect. We need dataliths.

Several scenarios could lead to neglect of electronic records for a generation or more. A global nuclear war could destroy major institutions, wreck electronic devices with EMPs, and force people to focus on staying alive. An asteroid hit or a supervolcano eruption could have a similar effect. Humanity might surive these things but take a century or more to return to a working technological society.

Less spectacularly, periods of intense international fear or attempts to manage the world economy might create an unfriendly climate for preserving records of the past. The world might go through a period of severe censorship. Lately religious barbarians have been sacking cities and destroying historical records that don’t fit with their doctrines. Barbarians generally burn themselves out quickly, but “enlightened” authorities can also decide that all “unenlightened” ideas should be banished for the good of us all. Prestigious institutions can be especially vulnerable to censorship because of their visibility and dependence on broad support. Even without legal prohibition, archival culture may shift to decide that some ideas aren’t worth preserving. Either way, it won’t be called censorship; it will be called “fair speech,” “fighting oppression,” “the right to be forgotten,” or some other euphemism that hasn’t yet lost credibility.

How great is the risk of these scenarios? Who can say? To calculate odds, you need repeatable causes, and the technological future will be a lot different from the comparatively low-tech past. But if we’re thinking on a span of thousands of years, we can’t dismiss it as negligible. Whatever may happen, the documents of the past are too valuable to be maintained only by their official guardians.

Hard copy will continue to be important. It’s also subject to most of the forms of loss I’ve mentioned, but some of it can survive for many years with no attention. As long as someone can understand the language it’s written in, or as long as its pictures remain recognizable, it has value. However, we can’t back away from digital storage and print everything we want to preserve. The advantages of bits are clear: easy reproduction and high storage density. This isn’t to say that archivists should abandon the strategy of storing documents with the best technology and migrating them regularly. In good times, that’s the most effective approach. But the bigger strategy should include insurance against the bad times, a form of storage that can survive neglect. Ideally it shouldn’t be in the hands of Harvard or the Library of Congress, but of many “guerilla archivists” acting on their own.

This strategy requires a storage medium which is highly durable and relatively simple to read. It doesn’t have to push the highest edges of storage density. It should be the modern equivalent of the stone tablet, a datalith.

There are devices which tend in this direction. Milleniata claims to offer “forever storage” in its M-Disc. Allegedly it has been “proven to last 1,000 years,” though I wonder how they managed to start testing in the Middle Ages. A DVD uses a complicated format, though, so it may not be readable even if it physically lasts that long. Hitachi has been working on quartz glass data storage that could last for millions of years and be read with an optical microscope. This would be the real datalith. As long as some people still know today’s languages, pulling out ASCII data should be a very simple cryptographic task. Unfortunately, the medium isn’t commercially available yet. Others have worked on similar ideas, such as the Superman memory crystal. Ironically, that article, which proclaims “the first document which will likely survive the human race,” has a broken link to its primary source less than two years after its publication.

Hopefully datalith writers will be available before too long, and after a few years they won’t be outrageously expensive. The records they create will be an important part of the long-term preservation of knowledge.

Honda MP3 player defect

Recently Eyal Mozes hired me to determine why the sound system in his new Honda Civic wouldn’t play some MP3 files. This was a chance to do some interesting investigative work, and I’ve found what I think is a previously unidentified product defect.

He sent me twenty MP3 files, ten of which would play on his system and ten of which wouldn’t. First I ran some preliminary tests, establishing that iTunes, QuickTime Player, Audacity, and even my older Honda stereo had no trouble with the files. Then I ran Exiftool on them and looked at the output to see what the difference was.

The first thing I looked for was variable bitrate encoding, which is the most common cause of failure to play MP3 files. None of them used a variable bitrate. Looking more closely, I saw that all the files had both ID3 V1 and V2 metadata. This is legitimate. In the files he’d indicated as non-playable, though, the length of the ID3 V2 segment was zero. This was true in all the non-playable files, while all the playable ones had ID3 V2 with some data fields. I verified with a hex dump that the start of the files was a ten-byte empty ID3 V2.3 segment.

I continued looking for any other systematic differences, but that was the only one I found. It’s highly likely that the MP3 software in Eyal’s car — and, therefore, in many Hondas and maybe even other makes — has a bug that makes a file fail to play if there’s a zero-length ID3 V2 segment. (Update: Just to be clear, this is a legitimate if unusual case, not a violation of the format.)

Eyal had gone to arbitration to get his vehicle returned under the warranty; Honda’s response was unimpressive. Initially, he told me, Honda claimed that the non-playing files were under DRM. This is nonsense; there’s no such thing as DRM on MP3 files. They withdrew this claim but then asserted that “compatibility issues” related to encoding were the problem, without giving any specifics. The ID3 header in an MP3 file is unrelated to the encoding of the file, and I didn’t see any systematic differences in encoding parameters between the playable and non-playable files. Honda claimed to be unable to tell how the files were encoded. They may not have been able to tell what software was used, but the only “how” that’s relevant is the encoding parameters.

This problem won’t make your brakes fail or your wheels fall off, but Honda should still treat it as a product defect, come up with a fix, and offer it to customers for free. If they can upgrade the firmware, that’s great; if not, they’ll have to issue replacement units. The bug sounds like one that’s easy to fix once the programmers are aware of it. The testing just wasn’t thorough enough to catch this case.

If anyone wants to hire me for more file format forensic work, let me know. This was fun to investigate.

Pono’s file format

I’ve been seeing weirdly intense hostility to the Pono music player and service. A Business Insider article implies that it’s a scheme by Apple to make you buy your music all over again at higher prices. Another article complains that it will hold “only” 1,872 tracks and protests that “the Average person” (their capitalization) doesn’t hear any improvement. I wonder if some of these people are outraged because they’re confusing Pono with Bono and thinking this is the new copy-proof file format which he said Apple is working on.

In fact, Pono isn’t using any new format and isn’t introducing DRM. Its files are in the well-known FLAC format. FLAC stands for “Free Lossless Audio Codec.” The term technically refers only to the codec, not the container, but it’s usually delivered in a “Native FLAC” container. It can also be delivered in an Ogg container, providing better metadata support and slightly larger files.

The “lossless” part of the name refers to FLAC’s compression. MP3 uses lossy compression, which removes some information, sacrificing a little audio quality to make the file smaller. FLAC delivers larger files, giving better quality and a larger file size for the same sampling rate and bit resolution. According to CNET, “Pono’s recordings will range from CD-quality 16-bit/44.1kHz to 24-bit/192kHz “ultra-high resolution.” 96 kilohertz (dividing 192 by 2 per the Nyquist theorem) is way beyond the threshold of human hearing, so it’s understandable that people are skeptical about whether it offers any benefit over a lower sampling rate. Frequencies that high are normally filtered out.

FLAC is non-proprietary and DRM-free, and it has an open source reference implementation. Someone could put FLAC into a DRM container, but then why not use a proprietary encoding? Using FLAC is a step forward from the patent-encumbered MP3, with license requirements that effectively lock out free software.

iTunes doesn’t support FLAC files, so the Business Insider claim that Pono is Apple’s way of making you buy music over again is idiotic. It’s like saying Windows 8 is an Apple scheme to make you buy new software.

As the number of gigabytes you can stick in your pocket keeps growing, the need for compression decreases. For many people, amount of music storage takes priority over improved sound quality, but some will pay for a high-end player that gives them the best sound possible. I don’t get why this infuriates so many critics. At any rate, the file format shouldn’t scare anyone.

For more discussion of FLAC as it relates to Pono, see “What is FLAC? The high-def MP3 explained” on CNET’s site; the headline is totally wrong, but the article itself is good.

Article on PDF/A validation with JHOVE

An article by Yvonne Friese does a good job of explaining the limitations of JHOVE in validating PDF/A. At the time that I wrote JHOVE, I wasn’t aware how few people had managed to write a PDF validator independent of Adobe’s code base; if I’d known, I might have been more intimidated. It’s a complex job, and adding PDF/A validation as an afterthought added to the problems. JHOVE validates only the file structure, not the content streams, so it can miss errors that make a file unusable. Finally, I’ve never updated JHOVE to PDF 1.7, so it doesn’t address PDF/A-2 or 3.

I do find the article flattering; it’s nice to know that even after all these years, “many memory institutions use JHOVE’s PDF module on a daily basis for digital long term archiving.” The Open Preservation Foundation is picking up JHOVE, and perhaps it will provide some badly needed updates.

The misuses of HTML frames

HTML framesets have some good uses, such as including third-party content. They also have misuses, such as disguising third-party involvement.

Recently I needed to set up domain forwarding for a subdomain registered with Godaddy. (The choice of registrar wasn’t my fault.) A couple of options were available, including one that claimed to guarantee that the subdomain would persist through navigation in the address bar. That sounded like a good thing, so I picked it.

At first it seemed to work fine; but when I tried to use the URL of an image on the site, there were weird errors. I soon found out what was going on: Godaddy was wrapping every page referenced by the subdomain in a frameset! This looks like a duck and clicks like a duck, but it isn’t one, and anything that tries to treat HTML as a JPEG file isn’t going to work very well.

Stack Overflow has several reports of people being bitten by this:

Frame wrapping is a good-enough solution for some cases, but when you aren’t told it’s happening, that’s a seriously wrong way to do it. It’s also a security concern, since your domain points at an IP address that you don’t control, and only indirectly at your own site.

This is a blog on file formats, not on irresponsible domain registrars, so the moral here is to realize that framesets aren’t a completely transparent way to provide third-party content. It’s fine to use them, but only if you’re aware that the frameset host and the frame provider are active partners.