5 Types of Bias in Data & Analytics

5 January 2017

Article written by Remco Weijers and Sebastiaan de Vries

“The average is nobody”

There has never been so much data available for making decisions. According to EMC (‘Digital Universe’ report from 2014), the quantity of data is going to see a tenfold increase between 2013 and 2020. As such, decisions are considered to be based on insights, supported by adequate data, analytics and research methods & techniques. This is a positive development, but at the same time it brings with it a number of pitfalls. It is important to remember that decisions are not necessarily guaranteed to be successful just because they are based on data. Results from data and analysis can be deliberately or unwittingly misinterpreted. This can lead to the outcome of analysis being mistakenly treated as truth. Decisions made on the basis of such truth can subsequently turn out to have been incorrect.

The chief cause of making the wrong decisions is what we call ‘bias’. Bias is taken to mean interference in the outcomes of research by predetermined ideas, prejudice or influence in a certain direction. Data can be biased but so can the people who analyse the data. When data is biased, we mean that the sample is not representative of the entire population. For example, drawing conclusions for the entire population of the Netherlands based on research into 10 students (the sample). When people who analyse data are biased, this means they want the outcomes of their analysis to go in a certain direction in advance.

We have set out the 5 most common types of bias:

1. Confirmation bias
Occurs when the person performing the data analysis wants to prove a predetermined assumption. They then keep looking in the data until this assumption can be proven. E.g. by intentionally excluding particular variables from the analysis. This often occurs when data analysts are briefed in advance to support a particular conclusion.
It is therefore advisable to not doggedly set out to prove a predefined conclusion, but rather to test presumed hypotheses in a targeted way.

2. Selection bias
This occurs when data is selected subjectively. As a result, the sample used is not a good reflection of the population. This error is often made in surveys. Frequently, there is also selection bias in customer panels: The customers that you (easily) find willing to participate in a customer panel are far from being “average customers”.
This too can be done deliberately or unwittingly. Just look at opinion polls in elections: Can it really be true that so many voters completely change their mind on the last day, or is it more likely that the sample on which the poll is based is not a good reflection of all the voters?
So you should always ask what sort of sample has been used for research. Avoid false extrapolation and make sure the results are applicable for the entire population.

3. Outliers
An outlier is an extreme data value. E.g. a customer with an age of 110 years. Or a consumer with €10 million in their savings account. You can spot outliers by inspecting the data closely, and particularly at the distribution of values. Values that are much higher, or much lower, than the region of almost all the other values. Outliers can make it a dangerous business to base a decision on the “average”. Just think: a customer with extreme spending habits can have a huge effect on the average profit per customer. If someone presents you with average values, you should check whether they have been corrected for outliers. For example, by basing the conclusions on the median – the middle value.

4. Overfitting en underfitting
Underfitting means when a model gives an oversimplistic picture of reality. Overfitting is the opposite: i.e. when the model is overcomplicated. Overfitting risks causing a certain assumption to be treated as the truth whereas in practice it is actually not the case. Always ask the data analyst what he or she has done to validate the model. If the analyst looks at you with a rather glazed expression, there is a good chance that the outcomes of the analysis have not been validated and therefore might not apply to the whole database of customers. Always ask the data analyst whether they have done a training or test sample. If the answer is no, it is highly likely that the outcomes of the analysis will not be applicable for all customers.

5. Confounding variabelen
If the research results show that when more ice creams are sold more people drown, ask whether they have checked for what are known as confounding variables. In this case, the confounding variable will be the temperature. If the weather is hotter, people will eat more ice cream and more people will go swimming. This is likely to result in more drownings than on a cold day.
A confounding variable is therefore a variable that is outside the scope of the existing analytical model but that does influence both the explanatory variable (in this case, ice cream sales) and the dependent variable (the number of drownings). Failing to allow for confounding variables can result in assuming there is a cause-effect relationship between two variables when there is in fact another variable behind the phenomenon. Bear in mind that a correlation is not the same thing as cause-effect. If a relationship between attributes is identified, this can be very helpful when you want to select the right customers for a particular campaign. But to prove that one leads to the other, it is advisable to test it in controlled A/B tests.

 

“The average is nobody”

 

It is crucial to unequivocally confirm that the conclusion from the outcomes of research and analysis is not influenced by bias. This is not solely the responsibility of the analyst in question. It is the shared responsibility of everyone directly involved (including the marketeer and the analyst) to reach a valid verdict on the basis of the correct data. In a world of marketing where data and analysis are playing an increasingly large part, you need to be able to rely on the correct facts. A fact is not a fact until it has been adequately proven. Or, as we hear often (or too often): “There are three sorts of lies: lies, outright lies, and statistics.”

Contact

Do you want to know more about this subject? Please contact Remco Weijers or Sebastiaan de Vries using the details below

Remco Weijers, Consultant

+31 33 258 28 30

info@cmotions.nl

Sebastiaan de Vries, Medior Consultant

+31 6 46 99 32 70

s.d.vries@cmotions.nl

Latest news

How data analysts experience cognitive biases and how to recognise them: Part 6

26 June 2018

Cognitive biases: We all have them and we regularly encounter them in everyday life. But how... read more

How data analysts experience cognitive biases and how to recognise them: Part 5

31 May 2018

Cognitive biases: We all have them and we regularly encounter them in everyday life. But how... read more

How do you Organise Analytics in Sports?

30 May 2018

On Monday 7 May, we were at the “Analytics in Sports” conference at the Johan Cruijff... read more

Subscribe to our newsletter

Never miss anything in the field of advanced analytics, data science and its application within organizations!