In a dashboard I was creating, I found the need to surface “bad” performers. Perhaps I could have accomplished this with a top 5 list, but I wanted something a bit smarter than that – I did not want to show anything if it was Not unusual. So I looked for the mathy definition for outlier and came up with this:
An outlier is a value 1.5 times greater than the interquartile range. To understand what that means, let’s look at an example dataset.
The list of numbers above is broken out into quartiles where the second quartile begins after 3, the third begins after 6 and fourth begins after 9. The interquartile range is the third quartile ending point minus the first. i.e. 9 – 3 = 6. Then to calculate if there are any outliers we can use the following formulas:
For the end of the range: Third Quartile + Interquartile Range * 1.5
For the beginning of the range: First Quartile – Interquartile Range * 1.5
End of Range (anything above this number): 9 + (6 * 1.5) = 18
Beginning of Range (anything below this number): 3 – (6 * 1.5) = -6
In the case of this dataset, that would mean the 46 would be considered an outlier and sure enough it looks like one:
To get at this calculation in SQL, we can use one of the new window functions that was introduced in 2012, Percentile_Disc(). This takes a parameter between 0 and 1 which represents the point of the quartile. To get the first quartile, you would use 0.25.