New open-source file validation project

The VeraPDF Consortium has announced that it has begun the prototyping phase for a new open-source validator of PDF/A. This is a piece of the PREFORMA (PREservation FORMAts) project; other branches will cover TIFF and audio-visual formats. Participants in VeraPDF are the Open Preservation Foundation, the PDF Association, the Digital Preservation Coalition, Dual Lab, and Keep Solutions.

Documents are available, including a functional and technical specification. It aims at being the “definitive” tool for determining if a PDF document conforms to the ISO 19005 requirements. It will separate the PDF parser from the higher-level validation, so a different parser can be plugged in.

Validating PDF is tough In JHOVE, I designed PDF/A validation as an afterthought to the PDF module. PDF/A requirements affect every level of the implementation, so that approach led to problems that never entirely went away. Making PDF/A validation a primary goal should help greatly, but having it sit on top of and independent from the PDF parser may introduce another form of the same problem.

PDF files can include components which are outside the spec, and PDF/A-3 permits their inclusion. This means that really validating PDF/A-3 is an open-ended task. Even in the earlier version of PDF/A, not everything that can be put into a file is covered by the PDF specification per se. The specification addresses this by providing for extensibility; add-ons can address these aspects as desired. In particular, the core validator won’t attempt thorough validation of fonts.

A Metadata Fixer will not just check documents for conformance, but in some cases will perform the necessary fixes to make a file PDF/A compliant.

JHOVE ignores the content streams, focusing only on the structure, so it could report a thoroughly broken file as well-formed and valid. JHOVE2 doesn’t list PDF in its modules. Analyzing the content stream data is a big task. In general, the project looks hugely ambitious, and not every ambitious digital preservation project has reached a successful end. If this one does, it will be a wonderful accomplishment.

Update on the JHOVE handover

There’s a brief piece by Becky McGuinness in D-Lib Magazine on the handover of JHOVE to the Open Preservation Foundation. It describes upcoming plans:

During March the OPF will be working with Portico and other members to complete the transfer of JHOVE to its new home. The latest code base will move to the OPF GitHub organisation page. All documentation, source code files, and full change history will be publicly available, alongside other OPF supported software projects, including JHOVE2, Fido, jpylyzer, and the SCAPE project tools.

Once the initial transfer is complete the next step will be to set up a continuous integration (CI) build on Travis, an online CI service that’s integrated with GitHub. This will ensure that all new code submissions are built and tested publicly and automatically, including all external pull requests. This will establish a firm foundation for future changes based on agile software development best practises.

With this foundation in place OPF will test and incorporate JHOVE fixes from the community into the new project. Several OPF members have already developed fixes based on their own automated processes, which they will be releasing to the community. Working as a group these fixes will be examined and tested methodically. At the same time the OPF’s priority will be to produce a Debian package that can be downloaded and installed from its apt repository.

Following the transfer OPF will gather requirements from its members and the wider digital preservation community. The OPF aims to establish and oversee a self-sustaining community around JHOVE that will take these requirements forward, carrying out roadmapping exercises for future development and maintenance. The OPF will also assess the need for specific training and support material for JHOVE such as documentation and online or virtual machine demonstrators.

It’s great to know that JHOVE still has a future a decade after its birth, but what boggles my mind is the next sentence:

The transfer of JHOVE is supported by its creators and developers: Harvard Library, Portico, the California Digital Library, and Gary McGath.

I never expected to see my name in a list like that!

Honda MP3 player defect

Recently Eyal Mozes hired me to determine why the sound system in his new Honda Civic wouldn’t play some MP3 files. This was a chance to do some interesting investigative work, and I’ve found what I think is a previously unidentified product defect.

He sent me twenty MP3 files, ten of which would play on his system and ten of which wouldn’t. First I ran some preliminary tests, establishing that iTunes, QuickTime Player, Audacity, and even my older Honda stereo had no trouble with the files. Then I ran Exiftool on them and looked at the output to see what the difference was.

The first thing I looked for was variable bitrate encoding, which is the most common cause of failure to play MP3 files. None of them used a variable bitrate. Looking more closely, I saw that all the files had both ID3 V1 and V2 metadata. This is legitimate. In the files he’d indicated as non-playable, though, the length of the ID3 V2 segment was zero. This was true in all the non-playable files, while all the playable ones had ID3 V2 with some data fields. I verified with a hex dump that the start of the files was a ten-byte empty ID3 V2.3 segment.

I continued looking for any other systematic differences, but that was the only one I found. It’s highly likely that the MP3 software in Eyal’s car — and, therefore, in many Hondas and maybe even other makes — has a bug that makes a file fail to play if there’s a zero-length ID3 V2 segment. (Update: Just to be clear, this is a legitimate if unusual case, not a violation of the format.)

Eyal had gone to arbitration to get his vehicle returned under the warranty; Honda’s response was unimpressive. Initially, he told me, Honda claimed that the non-playing files were under DRM. This is nonsense; there’s no such thing as DRM on MP3 files. They withdrew this claim but then asserted that “compatibility issues” related to encoding were the problem, without giving any specifics. The ID3 header in an MP3 file is unrelated to the encoding of the file, and I didn’t see any systematic differences in encoding parameters between the playable and non-playable files. Honda claimed to be unable to tell how the files were encoded. They may not have been able to tell what software was used, but the only “how” that’s relevant is the encoding parameters.

This problem won’t make your brakes fail or your wheels fall off, but Honda should still treat it as a product defect, come up with a fix, and offer it to customers for free. If they can upgrade the firmware, that’s great; if not, they’ll have to issue replacement units. The bug sounds like one that’s easy to fix once the programmers are aware of it. The testing just wasn’t thorough enough to catch this case.

If anyone wants to hire me for more file format forensic work, let me know. This was fun to investigate.

A new home for JHOVE

Over a decade ago, the Harvard University Libraries took me on as a contractor to start work on JHOVE. Later I became an employee, and JHOVE formed an important part of my work. When I left Harvard, I asked for continued “custody” of JHOVE so I could keep maintaining it, and got it. Over time it became less of a priority for me; there’s only so much time you can devote to something when no one’s paying you to do it.

After a long period of discussion, the Open Preservation Foundation (formerly the Open Planets Foundation) has taken up support of JHOVE. In addition to picking up the open source software, it’s resolved copyright issues in the documentation with Harvard, really over boilerplate that no one intended to enforced, but still an issue that had to be cleared.

Stephen Abrams, who was the real father of JHOVE, said, “We’re very pleased to see this transfer of stewardship responsibility for JHOVE to the OPF. It will ensure the continuity of maintenance, enhancement, and availability between the original JHOVE system and its successor JHOVE2, both key infrastructural components in wide use throughout the digital library community.”

JHOVE2 was originally supposed to be the successor to JHOVE, but it didn’t get enough funding to cover all the formats that JHOVE covers, so both are used, and the confusion of names is unfortunate. OPF has both in its portfolio. It doesn’t appear to have forked JHOVE to its Github repository yet, but I’m sure that’s coming soon.

My own Github repository for JHOVE should now be considered archival. Go forth and prosper, JHOVE.

Posted in News. Tags: , , , , . Comments Off on A new home for JHOVE

Pono’s file format

I’ve been seeing weirdly intense hostility to the Pono music player and service. A Business Insider article implies that it’s a scheme by Apple to make you buy your music all over again at higher prices. Another article complains that it will hold “only” 1,872 tracks and protests that “the Average person” (their capitalization) doesn’t hear any improvement. I wonder if some of these people are outraged because they’re confusing Pono with Bono and thinking this is the new copy-proof file format which he said Apple is working on.

In fact, Pono isn’t using any new format and isn’t introducing DRM. Its files are in the well-known FLAC format. FLAC stands for “Free Lossless Audio Codec.” The term technically refers only to the codec, not the container, but it’s usually delivered in a “Native FLAC” container. It can also be delivered in an Ogg container, providing better metadata support and slightly larger files.

The “lossless” part of the name refers to FLAC’s compression. MP3 uses lossy compression, which removes some information, sacrificing a little audio quality to make the file smaller. FLAC delivers larger files, giving better quality and a larger file size for the same sampling rate and bit resolution. According to CNET, “Pono’s recordings will range from CD-quality 16-bit/44.1kHz to 24-bit/192kHz “ultra-high resolution.” 96 kilohertz (dividing 192 by 2 per the Nyquist theorem) is way beyond the threshold of human hearing, so it’s understandable that people are skeptical about whether it offers any benefit over a lower sampling rate. Frequencies that high are normally filtered out.

FLAC is non-proprietary and DRM-free, and it has an open source reference implementation. Someone could put FLAC into a DRM container, but then why not use a proprietary encoding? Using FLAC is a step forward from the patent-encumbered MP3, with license requirements that effectively lock out free software.

iTunes doesn’t support FLAC files, so the Business Insider claim that Pono is Apple’s way of making you buy music over again is idiotic. It’s like saying Windows 8 is an Apple scheme to make you buy new software.

As the number of gigabytes you can stick in your pocket keeps growing, the need for compression decreases. For many people, amount of music storage takes priority over improved sound quality, but some will pay for a high-end player that gives them the best sound possible. I don’t get why this infuriates so many critics. At any rate, the file format shouldn’t scare anyone.

For more discussion of FLAC as it relates to Pono, see “What is FLAC? The high-def MP3 explained” on CNET’s site; the headline is totally wrong, but the article itself is good.

Posted in commentary. Tags: , , , . Comments Off on Pono’s file format

Article on PDF/A validation with JHOVE

An article by Yvonne Friese does a good job of explaining the limitations of JHOVE in validating PDF/A. At the time that I wrote JHOVE, I wasn’t aware how few people had managed to write a PDF validator independent of Adobe’s code base; if I’d known, I might have been more intimidated. It’s a complex job, and adding PDF/A validation as an afterthought added to the problems. JHOVE validates only the file structure, not the content streams, so it can miss errors that make a file unusable. Finally, I’ve never updated JHOVE to PDF 1.7, so it doesn’t address PDF/A-2 or 3.

I do find the article flattering; it’s nice to know that even after all these years, “many memory institutions use JHOVE’s PDF module on a daily basis for digital long term archiving.” The Open Preservation Foundation is picking up JHOVE, and perhaps it will provide some badly needed updates.

Posted in commentary. Tags: , , , . Comments Off on Article on PDF/A validation with JHOVE

The misuses of HTML frames

HTML framesets have some good uses, such as including third-party content. They also have misuses, such as disguising third-party involvement.

Recently I needed to set up domain forwarding for a subdomain registered with Godaddy. (The choice of registrar wasn’t my fault.) A couple of options were available, including one that claimed to guarantee that the subdomain would persist through navigation in the address bar. That sounded like a good thing, so I picked it.

At first it seemed to work fine; but when I tried to use the URL of an image on the site, there were weird errors. I soon found out what was going on: Godaddy was wrapping every page referenced by the subdomain in a frameset! This looks like a duck and clicks like a duck, but it isn’t one, and anything that tries to treat HTML as a JPEG file isn’t going to work very well.

Stack Overflow has several reports of people being bitten by this:

Frame wrapping is a good-enough solution for some cases, but when you aren’t told it’s happening, that’s a seriously wrong way to do it. It’s also a security concern, since your domain points at an IP address that you don’t control, and only indirectly at your own site.

This is a blog on file formats, not on irresponsible domain registrars, so the moral here is to realize that framesets aren’t a completely transparent way to provide third-party content. It’s fine to use them, but only if you’re aware that the frameset host and the frame provider are active partners.

Posted in commentary. Tags: , . 1 Comment »

Open Planets Foundation is now Open Preservation Foundation

The Open Planets Foundation is now the Open Preservation Foundation. This name change reflects its function; the old name grew out of the Planets project and never really made sense.

For the present, it’s still found on the Internet as openplanetsfoundation.org.

Posted in News. Tags: , , . Comments Off on Open Planets Foundation is now Open Preservation Foundation

The return of music DRM?

U2, already the most hated band in the world thanks to its invading millions of iOS devices with unsolicited files, isn’t stopping. An article on Time‘s website tells us, in vague terms, that

Bono, Edge, Adam Clayton and Larry Mullen Jr believe so strongly that artists should be compensated for their work that they have embarked on a secret project with Apple to try to make that happen, no easy task when free-to-access music is everywhere (no) thanks to piracy and legitimate websites such as YouTube. Bono tells TIME he hopes that a new digital music format in the works will prove so irresistibly exciting to music fans that it will tempt them again into buying music—whole albums as well as individual tracks.

It’s hard to read this as anything but an attempt to bring digital rights management (DRM) back to online music distribution. Users emphatically rejected it years ago, and Apple was among the first to drop it. You haven’t really “bought” anything with DRM on it; you’ve merely leased it for as long as the vendor chooses to support it. People will continue to break DRM, if only to avoid the risk of loss. The illegal copies will offer greater value than legal ones.

It would be nice to think that what U2 and Apple really mean is just that the new format will offer so much better quality that people will gladly pay for it, but that’s unlikely. Higher-quality formats such as AAC have been around for a long time, and they haven’t pushed the old standby MP3 out of the picture. Existing levels of quality are good enough for most buyers, and vendors know it.

Time implies that YouTube doesn’t compensate artists for their work. This is false. They often don’t bother with small independent musicians, though they will if they’re reminded hard enough (as Heather Dale found out), but it’s hard to believe that groups with powerful lawyers, such as U2, aren’t being compensated for every view.

DRM and force-feeding of albums are two sides of the same coin of vendor control over our choices. This new move shouldn’t be a surprise.

Posted in commentary, News. Tags: , , . Comments Off on The return of music DRM?
Follow

Get every new post delivered to your Inbox.

Join 62 other followers