As a statistician by trade, I have worked in many jobs in which I have had to advise people on what methods to use, for what data, and how. This has always been one of my favourite things to do, however over the last ten years there have been a few recurring themes. In this article, I will summarise (in my opinion), some of the most frequently made mistakes in data analysis, as a bit of food for thought. In the interest of not rambling on I have limited these to the big five, outlined below.
- Defining variables - The first step to any regression analysis, is identifying the independent and dependent variable. The independent variable is the cause, and its value is independent of other variables considered. The dependent variable is the effect, depending on changes in the independent variable.
- Correlation vs. causation - Just because something is correlated, does not imply a causation. In this case, common sense can be your best friend. Give an example of where it is causation and when it isn't. In the U.K., the cold weather is a great example of this. Typically people spend more weather when it is cold, and less when it is hot. However, this may be due to the fact that Christmas is in winter, with presents to buy, black friday and new year sales, which seems a more suitable explanation for the increased spending in winter.
- No findings is a finding - Sometimes there just isn’t any clear findings or conclusions to be made from data, particularly in real world data, which can often be messy, with things missing, and lots of errors. It is completely ok, to say that you have not found anything, and this is still a finding in itself, and reporting on it may well prevent someone else from spending lots of time doing the same analyses as you. Sometimes there is just nothing to find, and after hours of increasingly complex statistical methods, maybe removing lots of data, or variables, there is just nothing to find.
- Overuse of p-values - A p-value is often given as a result of a hypothesis test, and is defined in statistical terms as the probability for a given statistical model that, when the null hypothesis is true, the statistical summary would be equal to or more extreme than the actual observed results. The null hypothesis is typically what you are attempting to disprove with data, and is often that one variable does not impact another. Typically the smaller the p-value, the greater the evidence for rejecting the null hypothesis. A p-value takes values in the range 0-1, and a ‘good’ p-value is taken to be less than 0.05, corresponding to a 95% chance that the null hypothesis is not true. This is an oversimplification of this, however the focus here is the overuse. If 20 independent tests are conducted at the 0.05 significance level and all null hypotheses are true, there is a 64.2% chance of obtaining at least one false positive and the expected number of false positives is 1 (i.e. 0.05 × 20).
- Quality control - Quality controlling data is arguably the most important step of any data analysis, particularly when considering real world data. Mistakes are often made, by both humans and machines, and so thoroughly checking data for outliers and mistakes is very important, and as these errors can have drastic impacts on conclusions.