Requesting politely to stay in the dark will not serve journalism

At Salon, Richard constantly analyzed revenue per thousand page views vs. cost per thousand page views, unit by unit, story by story, author by author, and section by section. People didn’t want to look at this data because they were afraid that unprofitable pieces would be cut. It was the same pushback years before with basic traffic data. People in the newsroom didn’t want to consult it because they assumed you’d end up writing entirely for SEO. But this argument assumes that when we get data, we dispense with our wisdom. It doesn’t work that way. You can continue producing the important but unprofitable pieces, but as a business, you need to know what’s happening out there. Requesting politely to stay in the dark will not serve journalism.

– from Matt Stempeck’s liveblog of Richard Gingras’s Nieman Foundation speech

Junk data: why we still have no idea what the DfT’s most popular websites are

A couple of stories in the Telegraph and Daily Mail this week have hailed data released by the Department for Transport about the websites visited most often by workers at their department.

But if you look a little more closely at the raw data, it quickly becomes clear that these figures are being badly misrepresented by the newspapers involved. There’s a very important note on the last page of the data PDF (fascinatingly, missing from the Mail’s repost). It says:

Note : “number of hits” includes multiple components (e.g. text, images, videos), each of which are counted.

The difference between page views, visits and hits in web analytics is fairly important. Page views is the number of individual pages on a site that have been viewed; visits is the number of separate browsing sessions that have occurred. And hits is the number of individual files that are requested by the browser.

An individual page view can include dozens, or even hundreds, of hits. A single page view of the Telegraph front page, for instance, includes at least 18 hits just in the header of the page alone. That’s before we get to any images or ads. There are about another 40 image files called. It’s fair to suggest you could rack up the hits very quickly on most news websites – whereas very simple, single-purpose sites might register 10 or fewer per pageview.

Also important to note – if a website serves files from different sites – such as advertisements, or tracking codes – those sites will register a hit despite not never actually being seen by the person doing the browsing.

That explains why the second “most popular” site on the list is www.google-analytics.com – a domain that is impossible to visit, but which serves incredibly popular tracking code on millions of other websites. It’s probably safe to conjecture that it also explains the presence of other abnormalities – for instance, stats.bbc.co.uk, static.bbc.co.uk, news.bbcimg.co.uk, and cdnedge.bbc.co.uk, all in the top 10 and all impossible to actually visit. There are two IP addresses in the top 11 “most popular” sites, too.

As David Higgerson points out (in comments), there are some interesting patterns in the data.  But unless you know the number of hits per page, at the time the pages were viewed, as well as which ads were served from which other sites at the time, any straight comparison of the figures is meaningless. And the data itself is so noisy that any conclusions are dubious at best.

We can say that the BBC website is certainly popular, that the Bears Faction Lorien Trust LARP site probably got more visits than you might expect, and that civil servants do seem to like their news. Beyond that, the Mail’s claims of “cyberslacking”, of gambling sites (common advertisers) being popular and of there being six separate BBC sites in the top 10 are at best unsupported and at worst downright misleading.