Forensic stylometry

Fascinating post on Language Log about the analysis of Robert Galbraith’s The Cuckoo Calling, and how the analyst reached the conclusion that JK Rowling was a possible author.

For the past ten years or so, I’ve been working on a software project to assess stylistic similarity automatically, and at the same time, test different stylistic features to see how well they distinguish authors. De Morgan’s idea of average word lengths, for example, works — sort of. If you actually get a group of documents together and compare how different they are in average word length, you quickly learn two things. First, most people are average in word length, just as most people are average in height. Very few people actually write using loads of very long words, and few write with very small words, either. Second, you learn that average word length isn’t necessarily stable for a given author. Writing a letter to your cousin will have a different vocabulary than a professional article to be published in Nature. So it works, but not necessarily well. A better approach is not to use average word length, but to look at the overall distribution of word lengths. Still better is to use other measures, such as the frequency of specific words or word stems (e.g., how often did Madison use “by”?), and better yet is to use a combination of features and analyses, essentially analyzing the same data with different methods and seeing what the most consistent findings are. That’s the approach I took.

It’s interesting not just for its insight into a field that rarely comes into the public eye, but also for what’s written between the lines about how authors write. It suggests that, unless we really make an effort to disguise it, most writers have a linguistic fingerprint of sorts: a set of choices that we tend to make in roughly similar ways, often enough for a machine to notice when taken in aggregate. A writer’s voice goes beyond stylistic choices, genre and word choice, and comes down to the basic mechanics of the language they use.