The New York Times package mapper

From Nieman Lab, an interesting look at how the NYT maps traffic between stories, and analyses why and how things are providing onward traffic or causing people to click away from the site.

000000;">One example has been in our coverage of big news events, which we tend to blanket with all of the tools at our disposal: articles (both newsy and analytical) as well as a flurry of liveblogs, slideshows, interactive features, and video. But we can’t assume that readers will actually consume everything we produce. In fact, when we looked at how many readers actually visited more than a single page of related content during breaking news the numbers were much lower than we’d anticipated. Most visitors read only one thing.

This tool’s been used to make some decisions and change stories, individually, to improve performance in real time. That’s the acid test of tools like this – do they actually get used?

But the team that uses it is the data team, not the editorial team – yet. Getting editors to use it regularly is, it seems, about changing these data-heavy visualisations into something editors are already used to seeing as part of their workflow:

we’re thinking about better ways to automatically communicate these insights and recommendations in contexts that editors are already familiar with, such as email alerts, instant messenger chat bots, or perhaps something built directly into our CMS.

It’s not just about finding the data. It’s also about finding ways to use it and getting it to the people best placed to do so in forms that they actually find useful.

Forensic stylometry

Fascinating post on Language Log about the analysis of Robert Galbraith’s The Cuckoo Calling, and how the analyst reached the conclusion that JK Rowling was a possible author.

For the past ten years or so, I’ve been working on a software project to assess stylistic similarity automatically, and at the same time, test different stylistic features to see how well they distinguish authors. De Morgan’s idea of average word lengths, for example, works — sort of. If you actually get a group of documents together and compare how different they are in average word length, you quickly learn two things. First, most people are average in word length, just as most people are average in height. Very few people actually write using loads of very long words, and few write with very small words, either. Second, you learn that average word length isn’t necessarily stable for a given author. Writing a letter to your cousin will have a different vocabulary than a professional article to be published in Nature. So it works, but not necessarily well. A better approach is not to use average word length, but to look at the overall distribution of word lengths. Still better is to use other measures, such as the frequency of specific words or word stems (e.g., how often did Madison use “by”?), and better yet is to use a combination of features and analyses, essentially analyzing the same data with different methods and seeing what the most consistent findings are. That’s the approach I took.

It’s interesting not just for its insight into a field that rarely comes into the public eye, but also for what’s written between the lines about how authors write. It suggests that, unless we really make an effort to disguise it, most writers have a linguistic fingerprint of sorts: a set of choices that we tend to make in roughly similar ways, often enough for a machine to notice when taken in aggregate. A writer’s voice goes beyond stylistic choices, genre and word choice, and comes down to the basic mechanics of the language they use.

Why liveblogs almost certainly don’t outperform articles by 300%

In response to this study, linked to by among many others.

  1. The sample size is 28 pieces of content across 7 news stories – that content includes liveblogs, articles, picture galleries. That’s a startlingly small number for a sample which is meant to be representative.
  2. The study does not look at how these stories were promoted, or whether they were running stories (suited to live coverage), reaction blogs, or other things.
  3. The traffic sample is limited to news stories, and does not include sports, entertainment or other areas where liveblogs may be used, and that may have different traffic profiles.
  4. The study compares liveblogs, which often take a significant amount of time and editorial resource, with individual articles and picture galleries, some of which may take much less time and resource. If a writer can create four articles in the time it takes to create a liveblog, then the better comparison is between a liveblog and the equivalent amount of individual, stand-alone pieces.
  5. The study is limited to the Guardian. There’s no way to compare the numbers with other publications that might treat their live coverage differently, so no way to draw conclusions on how much of the traffic is due to the way the Guardian specifically handles liveblogs.
  6. The 300% figure refers to pageviews. Leaving aside the fact that this is not necessarily the best metric for editorial success, the Guardian’s liveblogs autorefresh, inflating the pageview figure for liveblogs.

All that shouldn’t diminish the study’s other findings, and of course it doesn’t mean that the headline figure is necessarily wrong. But I would take it with a hefty pinch of salt.

Junk data: why we still have no idea what the DfT’s most popular websites are

A couple of stories in the Telegraph and Daily Mail this week have hailed data released by the Department for Transport about the websites visited most often by workers at their department.

But if you look a little more closely at the raw data, it quickly becomes clear that these figures are being badly misrepresented by the newspapers involved. There’s a very important note on the last page of the data PDF (fascinatingly, missing from the Mail’s repost). It says:

Note : “number of hits” includes multiple components (e.g. text, images, videos), each of which are counted.

The difference between page views, visits and hits in web analytics is fairly important. Page views is the number of individual pages on a site that have been viewed; visits is the number of separate browsing sessions that have occurred. And hits is the number of individual files that are requested by the browser.

An individual page view can include dozens, or even hundreds, of hits. A single page view of the Telegraph front page, for instance, includes at least 18 hits just in the header of the page alone. That’s before we get to any images or ads. There are about another 40 image files called. It’s fair to suggest you could rack up the hits very quickly on most news websites – whereas very simple, single-purpose sites might register 10 or fewer per pageview.

Also important to note – if a website serves files from different sites – such as advertisements, or tracking codes – those sites will register a hit despite not never actually being seen by the person doing the browsing.

That explains why the second “most popular” site on the list is – a domain that is impossible to visit, but which serves incredibly popular tracking code on millions of other websites. It’s probably safe to conjecture that it also explains the presence of other abnormalities – for instance,,,, and, all in the top 10 and all impossible to actually visit. There are two IP addresses in the top 11 “most popular” sites, too.

As David Higgerson points out (in comments), there are some interesting patterns in the data.  But unless you know the number of hits per page, at the time the pages were viewed, as well as which ads were served from which other sites at the time, any straight comparison of the figures is meaningless. And the data itself is so noisy that any conclusions are dubious at best.

We can say that the BBC website is certainly popular, that the Bears Faction Lorien Trust LARP site probably got more visits than you might expect, and that civil servants do seem to like their news. Beyond that, the Mail’s claims of “cyberslacking”, of gambling sites (common advertisers) being popular and of there being six separate BBC sites in the top 10 are at best unsupported and at worst downright misleading.