Everyone who has ever done a study knows about outliers. While many of the values that you collect cluster relatively close to the mean, you find a few values that look way off.
Say you’re measuring the performance of participants on a cognitive task. The mean score is 12 and you find a standard deviation is 3. But one person scored 37 – much higher than anyone else. This value is probably your outlier.
There really isn’t anything abnormal about finding outliers in your data set. Within any normal distribution, there are extreme values found at the tail ends. While it is unlikely you will sample those values, it isn’t impossible either.
Outliers can also indicate faulty data or methods; in PCR tests – abnormal data points indicate human or technological errors.
While there are separate tests for determining outliers, we will keep this section short and sweet. Using simple mathematics, we can identify mild and extreme outliers. You can use these calculations to ensure a particularly high or low value is an outlier, before excluding it in further calculations. After all, outliers can skew data – especially when your sample size isn’t large.
First, you must calculate quartiles. Order all of your data points from highest to lowest. Your second quartile (Q2) is equivalent to the median of your data set. Then you can take a look at the values above and below this second quartile.
The median of the lower half is the first quartile (Q1) and the median of the higher values is your third quartile (Q3). Your interquartile range (IQR) is the difference between the third and first quartile.
For mild outliers, the value is either < Q1 – 1.5 * IQR or > Q3 + 1.5 * IQR. Approximately 1 in 150 observations for a normally distributed variable will be mild outliers.
For extreme outliers, the value is either < Q1 – 3 * IQR or > Q3 + 3 * IQR. These are far less common, occurring 1 in every 425 000 observations for a normally distributed variable.
These outlier detection formulas can fail for traits that don’t fit the normal distribution. If you are finding a lot of extreme outliers, it is important to see if your dataset is measuring a normally distributed variable and trait. Suffice it to say though, there are plenty of distributions with fatter tails. Here, simply using interquartile range won’t be enough to detect outliers.
Detecting the rate of outliers in your data is highly informative. While there outliers are expected within any random samples of a normally distributed variable, too many outliers may indicate a technical or methodological issue. It could be as simple as figuring out you entered a value wrong when recording your data, or used a wrong formula.
If you’ve confirmed that in fact, there are no such errors and your variable really is normally distributed, you can justify leaving the outlier out of your statistical analysis. Especially in small datasets, these outliers can skew the results and obscure group differences, associations or correlations.