The phenomenon of big data and data journalism has grown rapidly. Free data and efficient tools for data analysis become more and more available. Besides (multinational) businesses and governments, news organisations are increasingly aware of all these data sets and the possibilities. According to Aitamurto, Sirkkunen and Lehtonen (2011) journalists search through large collections of data and use statistical methods, visualizations, and interactive tools in order to find and create news. But what is ‘news’?
According to the study of Harcup and O’Neill (2001) a news story requires one or more certain news values in order to make news out of a story. They mentioned ten different values that could lead to newsworthy items. In short, news containing influential or famous people, entertainment, elements of surprise or great relevance to society, are most likely to become news. In addition, there is always an agenda of the news organisation itself which may contain stories to satisfy a particular need or demand. For data journalism, news is mostly about numbers and the hidden stories that they might contain.
Although numbers don’t lie, with data analysis comes certain risks. Data can be manipulated, misinterpreted or even misused. If data is misinterpreted by journalists, the obtained news from the data might eventually mislead the reader and may draw a distorted image of reality. In this article we’ll take a look into the risks of using data sets to create news within numbers and what one should realize when using data to tell a story.
When women don’t get maried, they’re screwed.
This seems a bit radical, although it has been concluded by journalists. Paul Bradshaw, an online journalist with the Birmingham City University, argues that data journalism can start in two ways: ‘there is a question that needs data, or there is a dataset that needs questioning’. But even though the growth of big data from public services, business or governmental organisations, not all data is of journalistic value or contains a newsworthy story. It is of great importance to make a thorough analysis of the available data, and determine whether the information contains news or might be of support to a story.
An article from The Washington Post last year, showed an example where open data is easily misinterpreted or misused. The Washington Post published an article about violence against women, and in particular which type of women are more likely to become a victim of assaults or abuse. According to the journalists from The Washington Post, data analysis showed that married women are safer than unmarried women, and girls raised by their own (married) father are less likely to be abused or assaulted than girls that are being raised without their own father. The claims from the Washington Post were based on a graph that was published in 2012 by the Department of Justice from the United States. According to Shannon Catalano, a statistician at the Bureau of Justice Statistics and the author of the study that was used in the article from the Washington Post, her data was presented without sufficient context.
There were much more factors to be mentioned that were associated with violence against women. The Washington Post only used data from a single variable, household composition, and made their conclusions, telling their readers a story which was actually misinterpreted from the original data. One could say the available data was actually misused in order to create a news article. And this is where my statement ‘In data we generally can’t observe the things we want to measure’ comes in. It is very rare that specific questions can be answered directly through the observations from a set of data. When we are searching for newsworthy stories in data, we’d rather look for information that can be related to the question we want to answer. Therefore, this could mean that we have to search multiple sources and data, before we eventually find a newsworthy story.
Ok let’s combine multiple data sets than…
Another phenomenon for searching hidden stories in data, is to combine multiple data sets to create news stories. Combining multiple data sets, and visualize them together, has been done by Simon Rogers, former editor at The Guardian, in the past. In 2012 The Guardian combined data of homicide by firearms rates per 100.000 people and percentages of homicides by firearm, and conducted a map that showed the average firearms per 100 people on earth.
Although mashup of different data sets is a powerful tool for data and news processing, there are some risks as well. In data journalism it is often easy to visualize and show what is going on in a specific event, but it is much harder to find out why something is happening. If we look at the homicide/firearms maps from The Guardian, we may be inclined to think that a larger amount of firearms is causing more homicides, but this is not necessarily true. In order to do such conclusions, one would eventually need the use of statistical methods to find relationships in the founded patterns in the data.
Tue relationship or merely a coincidental artifact?
This leads us to the hardest part in data journalism: finding a true relationship. When analyzing data sets, the main difficulty is to find patterns in the data that actually represent a true relationship. According to Harford (2014) finding causation in big data is much more difficult to do, and sometimes even impossible, than finding correlations in the data. It might even lead to multiple-comparison problems, where journalists look at as many possible patterns in data and calculate any correlation they can find. But the main problem is that if you do not know what is behind the found correlation(s), you have no idea what might have caused it in the first place. Just like the homicide/firearm example from The Guardian, according to the data it is easy to guess that an increased number of firearms, cause more homicides. But the thing is that if you guess what might have caused a certain correlation, there is a risk of getting a distorted view of the actual relationships that exist in data.
So when reviewing my statement ‘in data we generally can’t observe the things we want to measure’, I do not argue that we can’t observe anything in data sets. I’d like to point out that one needs to be careful when analyzing data, and has to take several pitfalls in mind before drawing any hard conclusions out of numbers. It is important to be critical at one’s own analysis, and to search for valid statistical evidence to ensure that true stories are to be told.
- Aitamurto, T., Sirkkunen, E., & Lehtonen, P. (2011). Trends in data journalism.Espoo: VTT.
- Fivethirtyeight. (2014). The Washington Post Misused the Data on Violence Against Women.
- Harcup, T., & O’neill, D. (2001). What is news? Galtung and Ruge revisited.Journalism studies, 2(2), 261-280.
- Harford, T. (2014). Big data: A big mistake?. Significance, 11(5), 14-19.
- Online journalism blog. (2011). The inverted pyramid of data journalism.
- The Guardian. (2012). Gun ownership homicides map.
- The Washington Post. (2014). The best way to end violence against women stop taking lovers and get married.