Eliminating data bias: why it matters and how to avoid it

In the world of data science and statistics, results are only as good as the data you use. This becomes increasingly important in machine learning and artificial intelligence, as when using these ‘black box’ algorithms, the origins of predictions or results are both unknown and untraceable.

Introduction

In the world of data science and statistics, results are only as good as the data you use. This becomes increasingly important in machine learning and artificial intelligence, as when using these ‘black box’ algorithms, the origins of predictions or results are both unknown and untraceable. In short, if data used for model training is biased, any inferences made based on these methods will subsequently also be biased. In business terms, this can as a consequence, ultimately lead to costly and in some cases dangerous decisions. In this article, I will discuss why bias is important, with examples of different types of bias, with tips for avoiding bias in your data.

What is data bias?

In terms of statistics, bias is a systematic tendency of results to be different from the true case. This is best explained through examples. A prime example of data bias in the real world is predictive policing, in which algorithms are used to analyse arrest data to both predict and prevent future crimes. Based on historical data, potential crime hotspots are identified, and additional patrolling of officers is carried out in these areas. In turn, due to additional patrols, more arrests occur in these areas, which in turn reinforces in the algorithm that these areas require additional policing. This results in data bias, perpetuating a feedback loop resulting in areas with a very high police presence, and others with none at all. Conclusions based on biased data can result in misleading results, which can impact business decisions, profits and progress. Despite this, simple checks and an understanding of your data can reduce bias and improve confidence in decisions.

Three types of data bias

In this blog, we focus on three important types of bias to look out for in analyses.

1. Selection bias - an error when choosing who or what should be included in a study.
Example: Below is an example of two subsets of a population taken. The image on the left shows a good representation of each fruit type, the one on the right completely misses a category.


This may happen unconsciously, or by accident, and the best way to avoid it is to ensure that sample sizes are appropriate for what you are doing, and to select cases using randomisation (i.e. a random number generator to allocate participants to each condition for example).

2. Observer bias - the tendency to see what we expect to see, or what we want to see. This can often be more common when considering more qualitative data, and can be avoided by using blinding.

3. Outlier bias - caused by data outliers that differ greatly from other samples. Consider the graph below, with the observation highlighted in red, which appears to have a much larger value than the rest.

The best fit line is very different with and without this point, and so it may be best to either remove it, or consider using summary statistics such as the median instead of the average, as it is less affected by outliers.

General tips for avoiding data bias

  1. Big data isn't always best - you may be tempted to use all data available, but in most cases, ensuring you are using a dataset which has been quality controlled, and thoroughly vetted is always the best way to avoid bias, particularly from outliers.
  2. Information is gold - the more information you know about the data you are using, the better equipped you are for identifying where biases may be introduced, and how to remove them. For example, has the data from a historical period been converted from manual to digital recording at some point? Simple mistakes such as an age being typed as 119 instead of 19 could massively skew results, and impact conclusions.
  3. Population appropriate - ensure that the population you are inferring any conclusions about is appropriate. It is not always possible to get all the data you desire, but that is ok, as long as you adjust your hypotheses and conclusions accordingly. Say you are a dog food brand testing your new treats at the local guide dog charity. It is great that the hard working (usually large, labradors or golden retrievers) are getting a well-earned treat. But if 9/10 guide dogs would recommend the treat, it doesn’t mean that it would go down just as well with a toy poodle.
  4. Don’t fall into the mean trap - don’t just look at averages, as these could be skewed by outliers. Look at the range of data values, and understand your data fully, what is the median, are the distribution of values symmetrical, what is the range of values? An in-depth understanding of data can help you to greatly reduce bias.

Summary

As we have seen, bias in any data can impact any results or consequent conclusions made, which can have an impact on decision making. An understanding of types of bias and how they arise can help to eliminate or at least minimise the impact of bias, with three big sources considered. Using the simple tips given in this article is a quick and easy step towards addressing bias.