PDF 2.0

As most people who read this blog know, the development of PDF didn’t end with the ISO 32000 (aka PDF 1.7) specification. Adobe has published three extensions to the specification. These aren’t called PDF 1.8, but they amount to a post-ISO version.

The ISO TC 171/SC 2 technical committee is working on what will be called PDF 2.0. The jump in major revision number reflects the change in how releases are being managed but doesn’t seem to portend huge changes in the format. PDF is no longer just an Adobe product, though the company is still heavily involved in the spec’s continued development. According to the PDF Association, the biggest task right now is removing ambiguities. The specification’s language will shift from describing conforming readers and writers to describing a valid file. This certainly sounds like an improvement. The article mentions that several sections have been completely rewritten and reorganized. What’s interesting is that their chapter numbers have all been incremented by 4 over the PDF 1.7 specification. We can wonder what the four new chapters are.

Leonard Rosenthol gave a presentation on PDF 2.0 in 2013.

As with many complicated projects, PDF 2.0 has fallen behind its original schedule, which expected publication in 2013. The current target for publication is the middle of 2016.

veraPDF validator

The veraPDF Consortium has announced a public prototype of its PDF validation software.

It’s ultimately intended to be “the definitive open source, file-format validator for all parts and conformance levels of ISO 19005 (PDF/A)”; however, it’s “currently more a proof of concept than a usable file format validator.”

New developments in JPEG

A report from the 69th meeting of the JPEG Committee, held in Warsaw in June, mentions several recent initiatives. The descriptions have a rather high buzzword-to-content ratio, but here’s my best interpretation of what I think they mean. What’s usually called “JPEG” is one of several file formats supported by the Joint Photographic Experts Group, and JFIF would be a more precise name. Not every format name that starts with JPEG refers to “JPEG” files, but if I refer to JPEG without further qualification here, it means the familiar format.

JPEG XS is described as a “low-latency lightweight image coding system.” It appears to be intended for transport and buffering purposes, such as sending images to a display device, rather than long-term file storage.

JPEG PLENO deals with “new imaging modalities such as … light-field, point-cloud and holographic imaging.” These are all approaches to 3-D imaging.

JPEG Privacy & Security deals with restricting access and maintaining data integrity. The idea seems to be to let images be displayed only by an authorized host “while maintaining backwards and forward compatibility to existing JPEG legacy solutions.” A Techdirt article has more details, describing the proposal as “adding DRM to images.” I don’t know how forward compatibility can work, since older software won’t have any way to distinguish authorized from unauthorized use.

JPEG XT is a set of extensions to JPEG for various purposes.

JPEG XR is a different image file format from JPEG. It grew out of Microsoft’s Windows Media Photo, aka HD Photo, and has advantages over JPEG but hasn’t achieved as much acceptance.

JPEG 2000 has been around for a while; it’s unrelated to the JPEG format and is rather complicated. It’s suitable for very large images stored at multiple resolutions, with the ability to present areas of interest in greater detail than the rest. It’s been hindered by a lack of good implementations. The announcement says it’s approved a new version of the OpenJPEG library, which is the reference implementation. When I last worked with it, the performance of OpenJPEG wasn’t very good; hopefully this version improves it.

File identification tools, part 9: JHOVE2

The story of JHOVE2 is a rather sad one, but I need to include it in this series. As the name suggests, it was supposed to be the next generation of JHOVE. Stephen Abrams, the creator of JHOVE (I only implemented the code), was still at Harvard, and so was I. I would have enjoyed working on it, getting things right that the first version got wrong. However, Stephen accepted a position with the California Digital Library (CDL), and that put an end to Harvard’s participation in the project. I thought about applying for a position in California but decided I didn’t want to move west. I was on the advisory board but didn’t really do much, and I had no involvement in the programming. I’m not saying I could have written JHOVE2 better, just explaining my relationship to the project. JHOVE2 logo

The institutions that did work on it were CDL, Portico, and Stanford University. There were two problems with the project. The big one was insufficient funding; the money ran out before JHOVE2 could boast a set of modules comparable to JHOVE. A secondary problem was usability. It’s complex and difficult to work with. I think if I’d been working on the project, I could have helped to mitigate this. I did, after all, add a GUI to JHOVE when Stephen wasn’t looking.

JHOVE has some problems that needed fixing. It quits its analysis on the first error. It’s unforgiving on identification; a TIFF file with a validation error simply isn’t a TIFF file, as far as it’s concerned. Its architecture doesn’t readily accommodate multi-file documents. It deals with embedded formats only on a special-case basis (e.g., Exif metadata in non-TIFF files). Its profile identification is an afterthought. JHOVE2 provided better ways to deal with these issues. The developers wrote it from scratch, and it didn’t aim for any kind of compatibility with JHOVE.

JHOVE2 is available as open-source software under the BSD license. The source code is on Bitbucket. Version 2.1.0 requires Java 6 or higher and, if the SGML module is used, the OpenSP SGML parser. It supports the ARC, GZIP, ICC color profile, SGML, Shapefile, TIFF, UTF-8, WARC, WAVE, and XML formats. NetCDF has “third-party development underway.” Three of these formats (ARC, GZIP, and WARC) are package formats for holding other files, taking advantage of JHOVE2’s design for processing nested content. The Shapefile module is an example of processing multi-file documents. There’s also an “Identifier” module which runes the DROID 6 identifier. PDF was on the schedule but still isn’t supported. (PDF is tough.)

The user guide gives an idea of the difficulty in using it. The installation section is over eleven pages long, and configuration is eight and a half pages. The assessment rule feature is powerful but the rule language is complex. Getting JHOVE2 to work in a production environment takes a serious commitment.

It’s not clear how widely used JHOVE2 is. I haven’t heard anything from libraries or archives that incorporate it into their production workflow. A query on Twitter resulted in several retweets but no responses. With a few more modules and some work on ease of use, it might have eclipsed JHOVE as it should have.

Update: The Bibliothèque Nationale de France mentions using JHOVE2 for characterizing Internet archive files.

Next: TBA. To read this series from the beginning, start here.

Redecoration in progress

I’m working on some changes in the appearance of this blog. You may see weird things while this is happening. Sorry for the inconvenience.

TIFF/A

TIFF has been around for a long time. Its latest official specification, TIFF 6.0, dates from 1992. The format hasn’t held still for 23 years, though. Adobe has issued several “technical notes” describing important changes and clarifications. Software developers, by general consensus, have ignored the requirement that value offsets have to be on a word boundary, since it’s a pointless restriction with modern computers. Private tags are allowed, and lots of different sources have defined new tags. Some of them have achieved wide acceptance, such as the TIFFTAG_ICCPROFILE tag (34675), which fills the need to associate ICC color profiles with images. Many applications use the EXIF tag set to specify metadata, but this isn’t part of the “standard” either.

In other words, TIFF today is the sum of a lot of unwritten rules.

It’s generally not too hard to deal with the chaos and produce files that all well-known modern applications can handle. On the other hand, it’s easy to produce a perfectly legal TIFF file that only your own custom application will handle as you intended. People putting files into archives need some confidence in their viability. Assumptions which are popular today might shift over a decade or two. Variations in metadata conventions might cause problems.

A restricted subset of PDF, called PDF/A, specifies rules that help to guarantee the long-term readability of PDF files that follow them. A group of academic archivists has begun work on an initiative to do the same for TIFF, calling it TIFF/A by analogy. It’s supported by the PREFORMA project.

TIFF/A logoSo far it’s still in the stage of gathering support. The site gives September 1, 2015 as the date to kick off discussions and March 1, 2016 as the target for an ISO submission. A white paper by Peter Fornaro and Lukas Rosenthaler at the University of Basel discusses the technical issues. It’s obviously just a first shot at the problem. At one point it states that “it is obvious that the TIFF-extensions [anything which isn’t baseline TIFF] should not be used for long term archival,” but then it admits the ICC profile tag (in fact, says it’s mandatory for color images) and the EXIF tag, and allows the IPTC metadata tag though it doesn’t recommend it.

The white paper doesn’t address the word alignment issue. This is something the TIFF/A consortium needs to take a stand on; either it should repudiate the word alignment requirement, or it should affirm it. If it sticks to strict conformance to the spec, a great many files won’t qualify.

TIFF/A presents a different set of challenges from PDF/A. On the one hand, TIFF is a vastly simpler format than PDF. (Trust me; I’ve written validators for both.) On the other hand, PDF is an ISO standard that hasn’t experienced two decades of entropy. I wish the people involved every success.

File identification tools, part 7: Apache Tika

Apache Tika is a Java-based open source toolkit for identifying files and extracting metadata and text content. I don’t have much personal experience with it, apart from having used it with FITS. Apache Software Foundation is actively maintaining it, and version 1.9 just came out on June 23, 2015. It can identify a wide range of formats and report metadata from a smaller but still impressive set. You can use Tika as a command line utility, a GUI application, or a Java library. You can find its source code on GitHub, or you can get its many components from the Maven Repository.

Tika isn’t designed to validate files. If it encounters a broken file, it won’t tell you much about how it violates the format’s expectations.

Originally it was a subproject of Lucene; it became a standalone project in 2010. It builds on existing parser libraries for various formats where possible. For some formats it uses its own libraries because nothing suitable was available. In most cases it relies on signatures or “magic numbers” to identify formats. While it identifies lots of formats, it doesn’t distinguish variants in as much detail as some other tools, such as DROID. Andy Jackson has written a document that sheds light on the comparative strengths of Tika and DROID. Developers can add their own plugins for unsupported formats. Solr and Lucene have built-in Tika integration.

Prior to version 1.9, Tika didn’t have support for batch processing. Version 1.9 has a tika-batch module, which is described in the change notes as “experimental.”

The book Tika in Action is available as an e-book (apparently DRM free, though it doesn’t specifically say so) or in a print edition. Anyone interested in using its API or building it should look at the detailed tutorial on tutorialspoint.com. The Tika facade serves basic uses of the API; more adventurous programmers can use the lower-level classes.

Next: NLNZ Metadata Extraction Tool. To read this series from the beginning, start here.

File identification tools, part 6: FITS

FITS is the File Information Tool Set, a “Swiss army knife” aggregating results from several file identification tools. The Harvard University Libraries created it, and though it was technically open-source from the beginning, it wasn’t very convenient for anyone outside Harvard at first. Other institutions showed interest, its code base moved from Google Code to GitHub, and now it’s used by a number of digital repositories to identify and validate ingested documents. Don’t confuse it with the FITS (Flexible Image Transport System) data format.

It’s a Java-based application requiring Java 7 or higher. Documentation is found on Harvard’s website. It wraps Apache Tika, DROID, ExifTool, FFIdent, JHOVE, the National Library of New Zealand Metadata Extractor, and four Harvard native tools. Work is currently under way to add the MediaInfo tool to enhance video file support. It’s released as open source software under the GNU LGPL license. The release dates show there’s been a burst of activity lately, so make sure you have the latest and best version.

FITS is tailored for ingesting files into a repository. In its normal mode of operation, it processes whole directories, including all nested subdirectories, and produces a single XML output file, which can be in either the FITS schema or other standard schemas such as MIX. You can run it as a standalone application or as a library. It’s possible to add your own tools to FITS.

You run FITS from a command file, fits.bat on Windows and fits.sh on Unix/Linux systems, including the Mac. The user manual provides full information.

You configure FITS with the file xml/fits.xml. It allows you to select which tools to use, and which file extensions each one will process. The <tool> element defines a tool to be used; its class attribute identifies its main class. If you want it to run only on files certain extensions, specify the include-exts attribute with a comma-separated list of extensions, not including the period. To run it on all extensions except certain was, specify the exclude-exts attribute with a comma-separated list of excluded extensions. The <output> element is trickier to deal with, and you shouldn’t mess with the <process> element unless you really need to diddle with performance.

FITS runs ExifTool as a separate process, since ExifTool is a Perl program. If your system doesn’t support Perl, ExifTool won’t run but everything else will still work.

I didn’t work directly on FITS when I was at Harvard, aside from my work on JHOVE, but in 2013 I traveled to the University of Leeds where I joined with some others in demonstrating some ways FITS could be improved, and this led to my getting a SPRUCE grant to implement the proposed improvements. Parts of this work were incorporated into the main line of the application.

The last I checked, FITS uses an old version of JHOVE because of compatibility issues. I don’t know if this has been updated.

Next: Apache Tika. To read this series from the beginning, start here.

File identification tools, part 5: JHOVE

In 2004, the Harvard University Libraries engaged me as a contractor to write the code for JHOVE under Stephen Abrams’ direction. I stayed around as an employee for eight more years. I mention this because I might be biased about JHOVE: I know about its bugs, how hard it is to install, what design decisions could have been better, and how spotty my support for it has been. Still, people keep downloading it, using it, and saying good things about it, so I must have done something right. Do any programmers trust the code they wrote ten years ago?

The current home of JHOVE is on GitHub under the Open Preservation Foundation, which has taken over maintenance of it from me. Documentation is on the OPF website. I urge people not to download it from SourceForge; it’s out of date there, and there have been reports of questionable practices by SourceForge’s current management. The latest version as of this writing is 1.11.

JHOVE stands for “JSTOR/Harvard Object Validation Environment,” though neither JSTOR nor Harvard is directly involved with it any longer. It identifies and validates files in a small set of formats, so it’s not a general-purpose identification tool, but does a fairly thorough job. The formats it validates are AIFF, GIF, HTML, JPEG, JPEG2000, PDF, TIFF, WAV, XML, ASCII, and UTF-8. If it doesn’t recognize a file as any of those formats, it will call it a “Bytestream.” You can use JHOVE as a GUI or command line application, or as a Java library. If you’re going to use the library or otherwise do complicated things, I recommend downloading my payment-optional e-book, JHOVE Tips for Developers. Installation and configuration are tricky, so follow the instructions carefully and take your time.

JHOVE shouldn’t be confused with JHOVE2, which has similar aims to JHOVE but has a completely different code base, API, and user interface. It didn’t get as much funding as its creators hoped, so it doesn’t cover all the formats that JHOVE does.

Key concepts in JHOVE are “well-formed” and “valid.” When allowed to run all modules, it will always report a file is a valid instance of something; it’s a valid bytestream if it’s not anything else. This has confused some people; a valid bytestream is nothing more than a sequence of zero or more bytes. Everything is a valid bytestream.

The concept of well-formed and valid files comes from XML. A well-formed XML file obeys the syntactic rules; a valid one conforms to a schema or DTD. JHOVE applies this concept to other formats, but it’s generally not as good a fit. Roughly, a file which is “well-formed but not valid” has errors, but not ones that should prevent rendering.

JHOVE doesn’t examine all aspects of a file. It doesn’t examine data streams within files or deal with encryption. It focuses on the semantics of a file rather than its content. However, it’s very aggressive in what it does examine, so that sometimes it will declare a file not valid when nearly all rendering software will process it correctly. If there’s a conflict between the format specification and generally accepted practice, it usually goes by the specification.

It checks for profiles within a format, such as PDF/A and TIFF/IT. It only reports full conformance to a profile, so if a file is intended to be TIFF/A but fails any tests for the profile, JHOVE will simply not list PDF/A as a profile. It won’t tell you why it fell short.

The PDF module has been the biggest adventure; PDF is really complicated, and its complexity has increased with each release. Bugs continue to turn up, and it covers PDF only through version 1.6. It needs to be updated for 1.7, which is equivalent to ISO 32000.

Sorry, I warned you that I’m JHOVE’s toughest critic. But I wouldn’t mind a chance to improve it a bit, through the funding mechanism I mentioned earlier in the blog.

Next: FITS. To read this series from the beginning, start here.