Boxplots are very useful graphs. They tell us a story of not only the range of the data for a variable, but also depict the mean and interquartile ranges.
Using a small dataset of mall customer information, I created the above boxplot. On the Y axis we have the variable for Age and along the X is the two categories for Male and Female.
Boxplots are great at seeing data across categorical data (such as Gender.)
The R code to create the boxplot above was simply:
boxplot(Mall_Customers$Age ~ Mall_Customers$Gender)
Note the Tilde between the two variables. In this notation, the Tilde notes which variable that will be compared with another. In this case Age vs. Gender.
Explaining the Boxplot
While the diagram has an obvious indication of more mail customers to female, there’s a lot more to take in. The dark line that resides within the rectangle is the Mean.
The Mean (or average) of Male customers is higher than that of Female. The boundaries of the rectangle (the sides) are the interquartile range. This is a range where 50% of the data resides. While there are more men outside the box, 50% of the male customers range in ages of roughly 27 to 50. Whereas 50% of female customers range from 29 to 48.
So called “whiskers” are denoted with the dashed lines above and below the rectangle. These lines end at horizontal end points. The length from whisker to whisker (including the rectangle) is the total range of data (excluding outliers.)
If outliers exist, they will be dots beyond the end of a whisker.
Going back to some other examples using the global suicide dataset, I made a boxplot of the amount of total suicides (from 1985 to 2015) by country.
The R code to generate this was simply:
> boxplot(master$suicides_no ~ master$country) > text(x=40, y=12000, label = "Japan") > text(x=60, y=20000, label = "Russia") > text(x=90, y=15000, label = "USA") > mtext(text="Total Suicides by Country (1985-2015)") > mtext(text = "Suicides", side=2)
In this cumulative count of suicides, we can quickly see hat Russia has the largest amount of suicides – and quite a lot of outliers. Notice the tall wick or whisker on the top end of the Russian boxplot.
USA is next in total suicide count. But what’s interesting, is that although there’s a lot of outliers skewing the Russian data, USA has a higher suicide mean than Russia.