Canvas fingerprinting, the technical stuff

The ability of websites to bypass privacy settings with “canvas fingerprinting” has caused quite a bit of concern, and it’s become a hot topic on the Code4lib mailing list. Let’s take a quick look at it from a technical standpoint. It is genuinely disturbing, but it’s not the unstoppable form of scrutiny some people are hyping it as.

The best article to learn about it from is “Pixel Perfect: Fingerprinting Canvas in HTML5,” by Keaton Mowery and Hovav Shacham at UCSD. It describes the basic technique and some implementation details.

Canvas fingerprinting is based on the <canvas> HTML element. It’s been around for a decade but was standardized for HTML5. In itself, <canvas> does nothing but define a blank drawing area with a specified width and height. It isn’t even like the <div> element, which you can put interesting stuff inside; if all you use is unscripted HTML, all you get is some blank space. To draw anything on it, you have to use JavaScript. There are two APIs available for this: the 2D DOM Canvas API and the 3D WebGL API. The DOM API is part of the HTML5 specification; WebGL relies on hardware acceleration and is less widely supported.

Either API lets you draw objects, not just pixels, to a browser. These include geometric shapes, color gradients, and text. The details of drawing are left to the client, so they will be drawn slightly differently depending on the browser, operating system, and hardware. This wouldn’t be too exciting, except that the API can read the pixels back. The getImageData method of the 2D context returns an ImageData object, which is a pixel map. This can be serialized (e.g., as a PNG image) and sent back to the server from which the page originated. For a given set of drawing commands and hardware and software configuration, the pixels are consistent.

Drawing text is one way to use a canvas fingerprint. Modern browsers use a programmatic description of a font rather than a bitmap, so that characters will scale nicely. The fine details of how edges are smoothed and pixels interpolated will vary, perhaps not enough for any user to notice, but enough so that reading back the pixels will show a difference.

However, the technique isn’t as frightening as the worst hype suggests. First, it doesn’t uniquely identify a computer. Two machines that have the same model and come from the same shipment, if their preinstalled software hasn’t been modified, should have the same fingerprint. It has to be used together with other identifying markers to narrow down to one machine. There are several ways for software to stop it, including blocking JavaScript from offending domains and disabling part or all of the Canvas API. What gets people upset is that neither blocking cookies nor using a proxy will stop it.

Was including getImageData in the spec a mistake? This can be argued both ways. Its obvious use is to draw a complex canvas once and then rubber-stamp it if you want it to appear multiple times; this can be faster than repeatedly drawing from scratch. It’s unlikely, though, that the designers of the spec thought about its privacy implications.

Cookbooks vs. learning books

A contract lead got me to try learning more about a technology which the client needs. Working from an e-book which I already had, I soon was thinking that it’s a really confused, ad hoc library. But then I remembered having this feeling before, when the problem was really the book. I looked for websites on the software and found one that explained it much better. The e-book had a lot of errors, using JavaScript terminology incorrectly and its own terminology inconsistently.

A feeling came over me, the same horrified realization the translator of To Serve Man had: “It’s a cookbook!” It wasn’t designed to let you learn how the software works, but to get you turning out code as quickly as possible. There are too many of these books, designed for developers who think that understanding the concepts is a waste of time. Or maybe the fault belongs less to the developers than to managers who want results immediately.

A book that introduces a programming language or API needs to start with the lay of the land. What are its basic concepts? How is it different from other approaches? It has to get the terminology straight. If it has functions, objects, classes, properties, and attributes, make it clear what each one is. There should be examples from the start, so you aren’t teaching arid theory, but you need to follow up with an explanation.

If you’re writing an introduction to Java, your “Hello world” example probably has a class, a main() function, and some code to write to System.out. You should at least introduce the concepts of classes, functions, and importing. That’s not the place to give all the details; the best way to teach a new idea is to give a simple version at first, then come back in more depth later. But if all you say is “Compile and run this code, and look, you’ve got output!” then you aren’t doing your job. You need to present the basic ideas simply and clearly, promise more information later, and keep the promise.

Don’t jump into complicated boilerplate before you’ve covered the elements it’s made of. The point of the examples should be to teach the reader how to use the technology, not to provide recipes for specific problems. The problem the developer has to solve is rarely going to be the one in the book. They can tinker with the examples until they fit their own problem, not really understanding them, but that usually results in complicated, inefficient, unmaintainable code.

Expert developers “steal” code too, but we know how it works, so we can take it apart and put it back together in a way that really suits the problem. The books we can learn from are the ones that put the “how it works” first. Cookbooks are useful too, but we need them after we’ve learned the tech, not when we’re trying to figure it out.

HTML and fuzzy validity

Andy Jackson wrote an interesting post on the question of HTML validity. Only registered Typepad users can comment, so it’s easier for me to add something to the discussion here.

When I worked on JHOVE, I had to address the question of valid HTML. A few issues are straightforward; the angle brackets for tags have to be closed, and so do quoted strings. Beyond that, everything seems optional. There are no required elements in HTML, not even html, head, or body; a blank file or a plain text file with no tags can be a valid HTML document. The rules of HTML are designed to be forgiving, which just makes it harder to tell if a document is valid or not. I’ve recommended that JHOVE users not use the HTML module; it’s time-consuming and doesn’t give you much useful information.

There are things in XHTML which aren’t legal in HTML. The “self-closing” tag (<tag/>) is good XHTML, but not always legal HTML. In HTML5, <input ... /> is legal, but <span ... /> isn’t, because input doesn’t require a closing tag but span does. (In other words, it’s legal only when it’s superfluous.) However, any recent browser will accept both of them.

The set of HTML documents which are de facto acceptable and unambiguous is much bigger than the set which is de jure correct. Unfortunately, the former is a fuzzy set. How far can you push the rules before you’ve got unsafe, ambiguous HTML? It depends on which browsers and versions you’re looking at, and how strenuous your test cases are.

The problem goes beyond HTML proper. Most browsers deal with improper tag nesting, but JavaScript and CSS can raise bigger issues. These are very apt to have vendor-specific features, and they may have major rendering problems in browsers for which they weren’t tested. A document with broken JavaScript can be perfectly valid, as far as the HTML spec is concerned.

It’s common for JavaScript to be included by an external reference, often on a completely different website. These scripts may themselves have external dependencies. Following the dependency chain is a pain, but without them all the page may not work properly. I don’t have data, but my feeling is that far more web pages are broken because of bad scripts and external references than because of bad HTML syntax.

So what do you do when validating web pages? Thinking of it as “validating HTML” pulls you into a messy area without addressing some major issues. If you insist on documents that are fully compliant with the specs, you’ll probably throw out more than you accept, without any good reason. But at the same time, unless you validate the JavaScript and archive all external dependencies, you’ll accept some documents that have significant preservation issues.

It’s a mess, and I don’t think anyone has a good solution.

Update: Andy Jackson has written a post responding to this, Web Archiving in the JavaScript Age.

Posted in commentary. Comments Off

OOXML: The good and the bad

An article by Markus Feilner presents a very critical view of Microsoft’s Open Office XML as it currently stands. There are three versions of OOXML — ECMA, Transitional, and Strict. All of them use the same extensions, and there’s no easy way for the casual user to tell which variant a document is. If a Word document is created on one computer in the Strict format, then edited on another machine with an older version of Word, it may be silently downgraded to Transitional, with resulting loss of metadata or other features.

On the positive side, Microsoft has released the Open XML SDK as open source on Github. This is at least a partial answer to Feilner’s complaint that “there are no free and open source solutions that fully support OOXML.”

Incidentally, I continue to hate Microsoft’s use of the deliberately confusing term “Open XML” for OOXML.

Thanks to @willpdp for tweeting the links referenced here.

The uses and abuses of PDF

PDF is a versatile format, but that doesn’t mean it should be used for everything. It’s a visual presentation format above all else. It lets you define a document with a specific appearance, with capabilities such as form filling and text searching. It’s not very good if you want a document that adapts to different device capabilities. If you need an editable format or a way to deliver structured data, there are much better alternatives.

When the Malaysian government released satellite data from the communications of Flight 370, which had disappeared, it delivered a PDF file. It looks very nice, but anyone who wants to extract and analyze the data has to do a lot of extra work. A spreadsheet or structured text (e.g., CSV) document is the right thing.

PDF can be used for e-books, but it’s not ideal. If you create normal sized pages, then on a phone they’ll either look tiny or require a lot of scrolling. Formats such as EPUB work better with a range of screen sizes.

Delivering text documents in PDF loses a lot of its value when the document is a scanned image rather than a text-based document. It can’t be searched, and people with visual disabilities can’t use text-to-speech. My condo association delivers its newsletters in scanned-image PDF. When I pointed out these problems at an owners’ meeting, I was told that the owners weren’t sophisticated enough to take advantage of those benefits. Our complex is a big one, and I’d be surprised if at least a few residents don’t use text-to-speech when they can. It’s not particularly hard to generate PDF files; scanning a finished document into a PDF seems like the hard way.

To maximize the usefulness of assistive technologies, you should create PDF/A if possible. It produces a slightly larger file, but it’s organized in a way that makes extraction of content easier and eliminates dependencies you might not have thought of.

Redacting PDFs is another tricky issue. If you simply black out an area, that’s the equivalent of gluing a piece of paper over it, and no harder to defeat. For advice on properly redacting documents, who better to turn to than the NSA? They may be a gang of criminals within the government, but they certainly know how to redact. It’s from 2006, though, so some of its advice could be dated.

There are lots of things you can do with PDF, but use it intelligently and where it’s appropriate.

Posted in commentary. Tags: . 2 Comments »

Ninjas, samurai, and artists

Lately I haven’t been posting as much on this blog. My professional responsibilities have shifted, and much as I still love the issues of file formats, I don’t have as much time to give attention to them. There are still general programming issues that are worth blogging about, though, and I’ll occasionally address these issues here, hopefully along with occasional file format posts.

cover thumbnail, Secrets of the JavaScript NinjaThis weekend I borrowed a book from the company library called Secrets of the JavaScript Ninja. It’s a better book than I expected from the title.

In an inspired error, the cover shows not a ninja but a samurai, with colorful armor, a banner, and a long sword. A ninja is a hit-and-run assassin; he shows up out of nowhere, attacks, and vanishes. A samurai is a dedicated soldier, the Japanese equivalent of a knight, and he follows a code of honor. The software development world has too many ninjas. The samurai is a better, if not ideal, model.

From the title I was expecting a cookbook, one of the many books that provide formulas to follow but no deep understanding. Instead, Secrets of the JavaScript Ninja is about understanding the language. Such books are even rarer for JavaScript than for most programming languages. I’d always tended to think of JavaScript as a half-baked derivative of Java, one where features such as classes, packages, and inheritance were left out to cut it down to a scripting language. This book, though, shows that it’s really quite a different language, a very powerful one in its own right. I still think the language has serious problems, the biggest being lack of standardization, but reading through the book, I’m learning how to think in JavaScript’s terms and to make use of features which other languages don’t have.

It’s not the language I want to talk about here, though; it’s the approach to any language or software technology. Too many programmers don’t have any deep understanding of their craft; they just have a bag of tools that they expect to solve problems for them. The worst can’t do much more than run a Web search for the code they need or beg on Stack Overflow for a solution to their problem. Once I actually had to fix some PHP that consultants from a big-name company had pasted from a website and didn’t know how to adapt to the problem at hand — and I don’t even know PHP! Those are your ninjas.

Even the samurai isn’t a great metaphor. They were a part of Japan’s entrenched feudal culture, and their opposition to capitalism promoted the warlike mindset that culminated in Japan’s role in World War II. Our metaphors should be based on creativity, not war and violence. We should think of ourselves as architects, sculptors, artists. We learn a craft and master the tools that go into it. There’s a real psychological affinity; I know a lot more software developers who are skilled musicians than are skilled fighters.

When I see a book called Secrets of the JavaScript Artist, then I’ll be pleased.

Update: I just noticed that the book itself says: “Ninjas were chosen for their martial arts skills rather than for their social standing or education. Dressed in black and with their faces covered, they were sent on missions alone or in small groups to attack the enemy with subterfuge and stealth, using any tactics to assure success; their only code was one of secrecy.” Just what you want in a programmer, right?

Posted in commentary. Tags: , , . Comments Off

In MS Word, the bullet bites back

There’s nothing new about Microsoft’s ignoring standards and ruining compatibility, but knowing the details is useful. One case I just learned about, from Mark Mandel, is the way it does bullet lists. This applies to the old Word DOC format on Mac OS X.

A 2008 OpenOffice Forum discussion explains the problem. If you create a bullet list in Word and import it into OpenOffice, the bullets are turned into something odd-looking. The file doesn’t use Unicode bullets, but instead uses the Microsoft Symbol font, which has its own nonstandard encoding. This applies only to bullets generated by list styles, not to ones you type in. On Windows, OpenOffice will display the files correctly, since it has access to the needed fonts and mapping.

Apparently the issue can also be manifested when creating a DOC file with OpenOffice and importing into Word, though I’m not clear on how that happens.

The problem is that Word 97/2000/2002 isn’t fully Unicode-compatible, mapping Unicode characters to the 8-bit encodings that its fonts need. This has presumably been fixed in the more recent versions that use DOCX (Office Open XML), but DOC is still widely used as an interchange format, so it’s an important issue. It’s also an illustration of the risks of using undocumented interchange formats.

Tools come and go, effort must be ongoing

In a comment on a JHOVE bug, I said offhandedly that it’s approaching the end of its life. This caused a certain amount of concern in Twitter discussions. Andy said that software tools are one of the best ways to “preserve specific, reproducible knowledge about processes.” I don’t think dropping support of a rather dated tool is a big concern, though, as long as the code doesn’t vanish.

A software application is good for a certain number of years before it needs to be either left as legacy code or completely rewritten. Throwing out code and starting over takes a lot of effort, but it can result in much better code. I started on JHOVE in 2003 as a contractor to the Harvard University Libraries. After a few years it became clear that some of the design decisions weren’t ideal. Its all-or-nothing approach and its tendency to give up after the first error have long been obvious problems. The PDF module is a kludge built on a crock, and that’s without even talking about its profiles. The TIFF module, on the other hand, has a fair amount of elegance.

JHOVE2 was supposed to be the successor to JHOVE. Its creators learned from JHOVE and produced a better design. What they didn’t have was enough time and money to cover all the formats that JHOVE covered. I’ve continued to work on JHOVE because I know it inside and out. Someone else could pick up the work, but it might make more sense for a newcomer to the code to join the JHOVE2 effort instead. However, Maurice noted on Twitter that there hasn’t been much activity lately on JHOVE2 issues.

Both JHOVE and JHOVE2 were funded under grants. When the grant money ended, progress slowed down. The one-time grant model is the wrong way to fund preservation software. It’s an ongoing effort; new formats arise and old ones change, and there are always bugs to fix. What I’d like to see happen is for major libraries in the US to create an ongoing consortium for preservation work, similar to the Planets project in Europe. Or better yet, a consortium bringing together libraries all over the world. It wouldn’t take a lot from any individual institution. Its job would be to maintain information, preservation tools, test suites, and so on, on an ongoing basis. Instead of rushing to create a tool and then leaving it to freelancers like (formerly) me to maintain, it would support maintenance of tools for as long as it made sense and creation of new ones when it’s appropriate.

My voice isn’t enough to call anything like this into existence, but I can hope.

Posted in commentary. Tags: , , . Comments Off

Rescuing Macintosh Files

On Wednesday, September 4, 2013, I talked with a small gathering of the Mac Tech Group at MIT on “Rescuing Macintosh Files.” There was a good discussion, with several people contributing valuable points.

The computer presentation which I used is available as a Powerpoint or OpenOffice document. The PowerPoint one had some problems at MIT with displaying all the images, so if you have a choice the PowerPoint one may work better.

Posted in commentary. Tags: , , , . Comments Off

TIFF/EP vs. Exif

I just discovered today that there are two different TIFF tags called “FocalPlaneResolutionUnit.” Tag 41488 goes by this name and is part of the Exif tag set. Accepted values for it are:

  • 1 = No absolute unit of measurement
  • 2 = Inch
  • 3 = Centimeter

Tag 37392 is a TIFF/EP (Electronic Photography) tag (working draft, final version not available online), also used in other raw formats, including DNG. Its accepted values are:

  • 1 = Inch
  • 2 = Metre
  • 3 = Centimetre
  • 4 = Millimetre
  • 5 = Micrometre

Recently I was sent a TIFF file, as a JHOVE issue, that had a tag 41488 with a value of 4. JHOVE correctly, but perhaps confusingly, reported that the fFocalPlaneResolutionUnit tag had an invalid value.

There are other tags in TIFF/EP that are equivalent, or nearly, to Exif tags. In some cases their values are identically specified, sometimes not. The Exif SubjectLocation tag is numbered 41492 and always has two shorts for its value, giving an X and Y value. The TIFF/EP counterpart is tag 37396, which can also have three shorts (specifying a circle) or four (specifying a rectangle).

I don’t know how this came about, but it’s something to watch out for in software that deals with both Exif and TIFF/EP tags. Some software may accept the EP extensions for Exif tags, but there’s no guarantee this will work.

Posted in commentary. Tags: , . Comments Off
Follow

Get every new post delivered to your Inbox.

Join 55 other followers