Category Archives: Tutorial

Preserving Google+ content

I recently got an email reminding me that my Google+ account will go away on April 2. My first reaction was a yawn. Google has made the service steadily less attractive over the years. I just checked my feed for the first time in months, and it consists entirely of posts by people I don’t follow, on topics I don’t care about. Posts from this blog and my writing blog get links automatically posted to Google+, but otherwise I haven’t posted in a long time.

One of my posts got two comments from people I know, so it’s not totally dead, but it’s close. Google made the service as unattractive as they could. Posts by strangers keep showing up. Comments appear and disappear as you’re trying to read them. But there was a time when Google+ was somewhat useful. You might have material there which you want to save. Fortunately, Google provides a way to do this.
Continue reading

A screen capture tip using Grab on the Mac

MacOS provides a few different ways to do screen captures. My personal favorite is Grab, which is found in the Applications/Utilities folder. It lets me capture a selection, a window, or the whole screen without having to remember any magic key combinations. I keep it in the Dock for quick access.

Grab has one deficiency, though. It can save screenshots only as TIFF files. If Apple had to pick just one format, that’s hardly the most useful one. But there’s an easy workaround.

After you’ve got your screen shot, press Command-C or choose “Copy” from the Edit menu. Open the Preview application. Press Command-N or select “New from clipboard” from the File menu. You now have the screenshot in Preview.

In Preview, press Command-S or choose “Save…” from the File menu. You’ll get a dialog to save the file, with a choice of formats: JPEG, JPEG2000, OpenEXR, PDF, PNG, or TIFF. Pick whichever one you like. If you’re going to put the image into a Web page, PNG is usually the best choice. Preview will remember your choice for next time. Then save the file.

If you prefer, you can do the equivalent in Photoshop, Gimp, or any other image-processing application, but Preview has the advantage of launching quickly and keeping the process simple.

That’s it. You can now use Grab to save screenshots to a Web-friendly format.

File identification tools, part 10: Siegfried

“Do we really need another PRONOM-based file format identification tool?” That’s what Richard Lehane asked rhetorically last year on the Open Preservation Foundation blog. It was obviously rhetorical, since he’d gone ahead and done just that with a new tool called Siegfried. Siegfried recently turned up in some tweets by Ross Spencer, so it’s worth a mention here.
Continue reading

Video

Video: A short history of graphic file formats

My video for today briefly covers graphic file formats from the forties to the present. I made some interesting discoveries along the way, especially Laposky’s CRT “Oscillons.”
Continue reading

Video

Understanding JPEG DCT

Some of us know that JPEG uses the Discrete Cosine Transform (DCT) to do lossy compression. A few people have more than the vaguest idea of what that means, and I don’t claim to be one of them. Today, though, I found a video on YouTube which gets me closer, even though I’m not very inclined toward advanced mathematics. It’s included in my file formats playlist on YouTube.
Continue reading

File identification tools, part 9: JHOVE2

The story of JHOVE2 is a rather sad one, but I need to include it in this series. As the name suggests, it was supposed to be the next generation of JHOVE. Stephen Abrams, the creator of JHOVE (I only implemented the code), was still at Harvard, and so was I. I would have enjoyed working on it, getting things right that the first version got wrong. However, Stephen accepted a position with the California Digital Library (CDL), and that put an end to Harvard’s participation in the project. I thought about applying for a position in California but decided I didn’t want to move west. I was on the advisory board but didn’t really do much, and I had no involvement in the programming. I’m not saying I could have written JHOVE2 better, just explaining my relationship to the project. JHOVE2 logo

The institutions that did work on it were CDL, Portico, and Stanford University. There were two problems with the project. The big one was insufficient funding; the money ran out before JHOVE2 could boast a set of modules comparable to JHOVE. A secondary problem was usability. It’s complex and difficult to work with. I think if I’d been working on the project, I could have helped to mitigate this. I did, after all, add a GUI to JHOVE when Stephen wasn’t looking.

JHOVE has some problems that needed fixing. It quits its analysis on the first error. It’s unforgiving on identification; a TIFF file with a validation error simply isn’t a TIFF file, as far as it’s concerned. Its architecture doesn’t readily accommodate multi-file documents. It deals with embedded formats only on a special-case basis (e.g., Exif metadata in non-TIFF files). Its profile identification is an afterthought. JHOVE2 provided better ways to deal with these issues. The developers wrote it from scratch, and it didn’t aim for any kind of compatibility with JHOVE.
Continue reading

File identification tools, part 7: Apache Tika

Apache Tika is a Java-based open source toolkit for identifying files and extracting metadata and text content. I don’t have much personal experience with it, apart from having used it with FITS. Apache Software Foundation is actively maintaining it, and version 1.9 just came out on June 23, 2015. It can identify a wide range of formats and report metadata from a smaller but still impressive set. You can use Tika as a command line utility, a GUI application, or a Java library. You can find its source code on GitHub, or you can get its many components from the Maven Repository.

Tika isn’t designed to validate files. If it encounters a broken file, it won’t tell you much about how it violates the format’s expectations.

Originally it was a subproject of Lucene; it became a standalone project in 2010. It builds on existing parser libraries for various formats where possible. For some formats it uses its own libraries because nothing suitable was available. In most cases it relies on signatures or “magic numbers” to identify formats. While it identifies lots of formats, it doesn’t distinguish variants in as much detail as some other tools, such as DROID. Andy Jackson has written a document that sheds light on the comparative strengths of Tika and DROID. Developers can add their own plugins for unsupported formats. Solr and Lucene have built-in Tika integration.

Prior to version 1.9, Tika didn’t have support for batch processing. Version 1.9 has a tika-batch module, which is described in the change notes as “experimental.”

The book Tika in Action is available as an e-book (apparently DRM free, though it doesn’t specifically say so) or in a print edition. Anyone interested in using its API or building it should look at the detailed tutorial on tutorialspoint.com. The Tika facade serves basic uses of the API; more adventurous programmers can use the lower-level classes.

Next: NLNZ Metadata Extraction Tool. To read this series from the beginning, start here.

File identification tools, part 6: FITS

FITS is the File Information Tool Set, a “Swiss army knife” aggregating results from several file identification tools. The Harvard University Libraries created it, and though it was technically open-source from the beginning, it wasn’t very convenient for anyone outside Harvard at first. Other institutions showed interest, its code base moved from Google Code to GitHub, and now it’s used by a number of digital repositories to identify and validate ingested documents. Don’t confuse it with the FITS (Flexible Image Transport System) data format.

It’s a Java-based application requiring Java 7 or higher. Documentation is found on Harvard’s website. It wraps Apache Tika, DROID, ExifTool, FFIdent, JHOVE, the National Library of New Zealand Metadata Extractor, and four Harvard native tools. Work is currently under way to add the MediaInfo tool to enhance video file support. It’s released as open source software under the GNU LGPL license. The release dates show there’s been a burst of activity lately, so make sure you have the latest and best version.

FITS is tailored for ingesting files into a repository. In its normal mode of operation, it processes whole directories, including all nested subdirectories, and produces a single XML output file, which can be in either the FITS schema or other standard schemas such as MIX. You can run it as a standalone application or as a library. It’s possible to add your own tools to FITS.
Continue reading

File identification tools, part 5: JHOVE

In 2004, the Harvard University Libraries engaged me as a contractor to write the code for JHOVE under Stephen Abrams’ direction. I stayed around as an employee for eight more years. I mention this because I might be biased about JHOVE: I know about its bugs, how hard it is to install, what design decisions could have been better, and how spotty my support for it has been. Still, people keep downloading it, using it, and saying good things about it, so I must have done something right. Do any programmers trust the code they wrote ten years ago?

The current home of JHOVE is on GitHub under the Open Preservation Foundation, which has taken over maintenance of it from me. Documentation is on the OPF website. I urge people not to download it from SourceForge; it’s out of date there, and there have been reports of questionable practices by SourceForge’s current management. The latest version as of this writing is 1.11.

JHOVE stands for “JSTOR/Harvard Object Validation Environment,” though neither JSTOR nor Harvard is directly involved with it any longer. It identifies and validates files in a small set of formats, so it’s not a general-purpose identification tool, but does a fairly thorough job. The formats it validates are AIFF, GIF, HTML, JPEG, JPEG2000, PDF, TIFF, WAV, XML, ASCII, and UTF-8. If it doesn’t recognize a file as any of those formats, it will call it a “Bytestream.” You can use JHOVE as a GUI or command line application, or as a Java library. If you’re going to use the library or otherwise do complicated things, I recommend downloading my payment-optional e-book, JHOVE Tips for Developers. Installation and configuration are tricky, so follow the instructions carefully and take your time.

JHOVE shouldn’t be confused with JHOVE2, which has similar aims to JHOVE but has a completely different code base, API, and user interface. It didn’t get as much funding as its creators hoped, so it doesn’t cover all the formats that JHOVE does.

Key concepts in JHOVE are “well-formed” and “valid.” When allowed to run all modules, it will always report a file is a valid instance of something; it’s a valid bytestream if it’s not anything else. This has confused some people; a valid bytestream is nothing more than a sequence of zero or more bytes. Everything is a valid bytestream.

The concept of well-formed and valid files comes from XML. A well-formed XML file obeys the syntactic rules; a valid one conforms to a schema or DTD. JHOVE applies this concept to other formats, but it’s generally not as good a fit. Roughly, a file which is “well-formed but not valid” has errors, but not ones that should prevent rendering.

JHOVE doesn’t examine all aspects of a file. It doesn’t examine data streams within files or deal with encryption. It focuses on the semantics of a file rather than its content. However, it’s very aggressive in what it does examine, so that sometimes it will declare a file not valid when nearly all rendering software will process it correctly. If there’s a conflict between the format specification and generally accepted practice, it usually goes by the specification.

It checks for profiles within a format, such as PDF/A and TIFF/IT. It only reports full conformance to a profile, so if a file is intended to be TIFF/A but fails any tests for the profile, JHOVE will simply not list PDF/A as a profile. It won’t tell you why it fell short.

The PDF module has been the biggest adventure; PDF is really complicated, and its complexity has increased with each release. Bugs continue to turn up, and it covers PDF only through version 1.6. It needs to be updated for 1.7, which is equivalent to ISO 32000.

Sorry, I warned you that I’m JHOVE’s toughest critic. But I wouldn’t mind a chance to improve it a bit, through the funding mechanism I mentioned earlier in the blog.

Next: FITS. To read this series from the beginning, start here.