In data we generally can’t observe the things we want to measure

The phenomenon of big data and data journalism has grown rapidly. Free data and efficient tools for data analysis become more and more available. Besides (multinational) businesses and governments, news organisations are increasingly aware of all these data sets and the possibilities. According to Aitamurto, Sirkkunen and Lehtonen (2011) journalists search through large collections of data and use statistical methods, visualizations, and interactive tools in order to find and create news. But what is ‘news’?

According to the study of Harcup and O’Neill (2001) a news story requires one or more certain news values in order to make news out of a story. They mentioned ten different values that could lead to newsworthy items. In short, news containing influential or famous people, entertainment, elements of surprise or great relevance to society, are most likely to become news. In addition, there is always an agenda of the news organisation itself which may contain stories to satisfy a particular need or demand. For data journalism, news is mostly about numbers and the hidden stories that they might contain.

Although numbers don’t lie, with data analysis comes certain risks. Data can be manipulated, misinterpreted or even misused. If data is misinterpreted by journalists, the obtained news from the data might eventually mislead the reader and may draw a distorted image of reality. In this article we’ll take a look into the risks of using data sets to create news within numbers and what one should realize when using data to tell a story.

When women don’t get maried, they’re screwed.

This seems a bit radical, although it has been concluded by journalists. Paul Bradshaw, an online journalist with the Birmingham City University, argues that data journalism can start in two ways: ‘there is a question that needs data, or there is a dataset that needs questioning’. But even though the growth of big data from public services, business or governmental organisations, not all data is of journalistic value or contains a newsworthy story.  It is of great importance to make a thorough analysis of the available data, and determine whether the information contains news or might be of support to a story.

An article from The Washington Post last year, showed an example where open data is easily misinterpreted or misused. The Washington Post published an article about violence against women, and in particular which type of women are more likely to become a victim of assaults or abuse. According to the journalists from The Washington Post, data analysis showed that married women are safer than unmarried women, and girls raised by their own (married) father are less likely to be abused or assaulted than girls that are being raised without their own father. The claims from the Washington Post were based on a graph that was published in 2012 by the Department of Justice from the United States. According to Shannon Catalano, a statistician at the Bureau of Justice Statistics and the author of the study that was used in the article from the Washington Post, her data was presented without sufficient context.

WP graph

There were much more factors to be mentioned that were associated with violence against women. The Washington Post only used data from a single variable, household composition, and made their conclusions, telling their readers a story which was actually misinterpreted from the original data. One could say the available data was actually misused in order to create a news article. And this is where my statement ‘In data we generally can’t observe the things we want to measure’ comes in. It is very rare that specific questions can be answered directly through the observations from a set of data. When we are searching for newsworthy stories in data, we’d rather look for information that can be related to the question we want to answer. Therefore, this could mean that we have to search multiple sources and data, before we eventually find a newsworthy story.

Ok let’s combine multiple data sets than…

Another phenomenon for searching hidden stories in data, is to combine multiple data sets to create news stories. Combining multiple data sets, and visualize them together, has been done by Simon Rogers, former editor at The Guardian, in the past. In 2012 The Guardian combined data of homicide by firearms rates per 100.000 people and percentages of homicides by firearm, and conducted a map that showed the average firearms per 100 people on earth.

1. homocide by firearm rate per 100.000 2. percentages of homocides by firearm 3. average firearms per 100 people

Although mashup of different data sets is a powerful tool for data and news processing, there are some risks as well. In data journalism it is often easy to visualize and show what is going on in a specific event, but it is much harder to find out why something is happening. If we look at the homicide/firearms maps from The Guardian, we may be inclined to think that a larger amount of firearms is causing more homicides, but this is not necessarily true. In order to do such conclusions, one would eventually need the use of statistical methods to find relationships in the founded patterns in the data.

Tue relationship or merely a coincidental artifact?

This leads us to the hardest part in data journalism: finding a true relationship. When analyzing data sets, the main difficulty is to find patterns in the data that actually represent a true relationship. According to Harford (2014) finding causation in big data is much more difficult to do, and sometimes even impossible, than finding correlations in the data. It might even lead to multiple-comparison problems, where journalists look at as many possible patterns in data and calculate any correlation they can find. But the main problem is that if you do not know what is behind the found correlation(s), you have no idea what might have caused it in the first place. Just like the homicide/firearm example from The Guardian, according to the data it is easy to guess that an increased number of firearms, cause more homicides. But the thing is that if you guess what might have caused a certain correlation, there is a risk of getting a distorted view of the actual relationships that exist in data.

So when reviewing my statement ‘in data we generally can’t observe the things we want to measure’, I do not argue that we can’t observe anything in data sets. I’d like to point out that one needs to be careful when analyzing data, and has to take several pitfalls in mind before drawing any hard conclusions out of numbers. It is important to be critical at one’s own analysis, and to search for valid statistical evidence to ensure that true stories are to be told.  



6 thoughts on “In data we generally can’t observe the things we want to measure

  1. Really interesting acrticle, you point out the risk of false correlation, which is also mentioned by Simon Rogers in his video. Just because two events occur at the same time, it doesn’t have to be a correlation. I have another example for you what makes it all a little easier to understand. In the summer people eat more ice cream and in the summer more people drown, so the more people eat ice cream, the more people drowning? No of course not! Often there is a mediator, like good weather. I found a nice and funny newsarticle which suits your blog perfectly:

    I think you already have the feeling I agree with you and I really do. I think people have to be intuitive en don’t just believe any correlation. Sometimes it’s so obvious that the correlation is wrong, just like in my example or the example from the dailymail. Again, really interesting article!


  2. I really like the example you have given about the women being abused and the relationship they try to reveal about certain types of households. It is a very good example of data journalism gone wrong.
    However, your statement is that in data we generally cannot observe the things we want to measure. I partially agree with you, that not all data has the potential to measure certain things like relationships and causes of certain things. But… Isn’t it one of the main goals of journalism to provide news about occuring events to the society? I think it is not always necessary to find a causality or a relationship to write a news article about something that is happening. Most of the data visualizations where data is combined try to make it more visual for the newsreaders so that they can form a better image in their head about the topic. Isn’t that enough news on its own?
    Logically, I also think that leading to wrong interpretation or misusing statistics is harmful.


  3. I agree with the statement that one needs to be careful when analyzing data, and has to take several pitfalls in mind before drawing any hard conclusions out of numbers. I think that a lot of journalists nowadays can be very selective in what ‘facts’ they decide to extract from a certain study. For example, this article:,9171,2017200,00.html. It refers to a study where non-drinkers had a higher mortality rate than heavy drinkers. A journalist may write an article called ‘Heavy drinkers outlive non-drinkers’, without necessarily critically analyzing the whole data set. So I think that the danger lies in this selective data analyzing by journalists.


Geef een reactie

Vul je gegevens in of klik op een icoon om in te loggen. logo

Je reageert onder je account. Log uit /  Bijwerken )

Google+ photo

Je reageert onder je Google+ account. Log uit /  Bijwerken )


Je reageert onder je Twitter account. Log uit /  Bijwerken )

Facebook foto

Je reageert onder je Facebook account. Log uit /  Bijwerken )


Verbinden met %s