Junk data: why we still have no idea what the DfT’s most popular websites are

A couple of stories in the Telegraph and Daily Mail this week have hailed data released by the Department for Transport about the websites visited most often by workers at their department.

But if you look a little more closely at the raw data, it quickly becomes clear that these figures are being badly misrepresented by the newspapers involved. There’s a very important note on the last page of the data PDF (fascinatingly, missing from the Mail’s repost). It says:

Note : “number of hits” includes multiple components (e.g. text, images, videos), each of which are counted.

The difference between page views, visits and hits in web analytics is fairly important. Page views is the number of individual pages on a site that have been viewed; visits is the number of separate browsing sessions that have occurred. And hits is the number of individual files that are requested by the browser.

An individual page view can include dozens, or even hundreds, of hits. A single page view of the Telegraph front page, for instance, includes at least 18 hits just in the header of the page alone. That’s before we get to any images or ads. There are about another 40 image files called. It’s fair to suggest you could rack up the hits very quickly on most news websites – whereas very simple, single-purpose sites might register 10 or fewer per pageview.

Also important to note – if a website serves files from different sites – such as advertisements, or tracking codes – those sites will register a hit despite not never actually being seen by the person doing the browsing.

That explains why the second “most popular” site on the list is www.google-analytics.com – a domain that is impossible to visit, but which serves incredibly popular tracking code on millions of other websites. It’s probably safe to conjecture that it also explains the presence of other abnormalities – for instance, stats.bbc.co.uk, static.bbc.co.uk, news.bbcimg.co.uk, and cdnedge.bbc.co.uk, all in the top 10 and all impossible to actually visit. There are two IP addresses in the top 11 “most popular” sites, too.

As David Higgerson points out (in comments), there are some interesting patterns in the data.  But unless you know the number of hits per page, at the time the pages were viewed, as well as which ads were served from which other sites at the time, any straight comparison of the figures is meaningless. And the data itself is so noisy that any conclusions are dubious at best.

We can say that the BBC website is certainly popular, that the Bears Faction Lorien Trust LARP site probably got more visits than you might expect, and that civil servants do seem to like their news. Beyond that, the Mail’s claims of “cyberslacking”, of gambling sites (common advertisers) being popular and of there being six separate BBC sites in the top 10 are at best unsupported and at worst downright misleading.

Published by

Mary Hamilton

I'm a journalist-type tech-ish geek person, working in that interesting ambiguous place where reporting the news meets all sorts of peripheral skills. In my spare time I herd zombies, design games and write stuff.

7 thoughts on “Junk data: why we still have no idea what the DfT’s most popular websites are”

  1. I checked out the two highest ranked IP addresses, they both belonged to Akamai, which run a content distribution network, so could represent an amalgamation of visits to any number of sites using their services that happened to be hosted on that IP’s cache at the time. It actual begs the question which is worse – the way the press have done the analysis, or the way the data was released.

  2. @currybet Good point. I’d love to see how the FOI request was worded – the DfT doesn’t seem to include the questions in their disclosure log, and it wasn’t made through They Work For You as far as I can see. I suspect this is a double whammy of a bad data release and press misinterpretation – possibly in part because there’s an assumption the data answers a question that it doesn’t.

  3. @newsmary At least the Telegraph has the vague excuse that http://www.telegraph.co.uk is the first of their domains to be listed. The Mail should surely have twigged the data was a bit suspect when they spotted that their top entry was i.dailymail.co.uk which is their image server – but in <a href=”http://www.dailymail.co.uk/news/article-2020001/Cyberslackers-Civil-Service-Officials-make-tens-thousands-web-visits-taxpayers-time-money.html”>the write-up that accompanies the data</a> they explicitly say that makes them the 31st most popular site to visit.

  4. @currybet David Higgerson has managed to find the accompanying letter – http://assets.dft.gov.uk/foi/dft-f0007532/f0007532-letter.pdf – which asks: “Q1. A breakdown of websites visited by staff in descending order from most to least visited (if possible, please include the number of hits each website received).”

    To me that looks like a requester who doesn’t know the difference between pages viewed and hits, and I’m not totally sure what the DfT should have done in this situation. They’re certainly very clear in the letter that hits and pageviews aren’t the same thing – so it looks a lot more like press misinterpretation to me.

  5. I’m so pleased you wrote this. While I was reading the reports I kept thinking “but hits tells you next to nothing”.

    I am left wondering why hits continues to be used as a synomym for page views by many when it has never, in fact, meant the same thing.

  6. @likeaword Glad to be of service! I suspect it boils down to a very basic lack of understanding of how browsers work, and the fact that “hits” used to be the measure of choice back when the internet was young. Some folks never understood what it meant, just knew it measured usage in some way, and have never decided (or in some cases needed) to update or broaden that knowledge.

What do you think?