I’ve taught thousands of students about data and if there’s one concept I hope they’ve retained, it’s that summary statistics such as averages and medians compress information. Summary statistics take a set of numbers and try to represent them all with a single number.
The question is, as an analyst, are you comfortable with that compression? Do you feel like the statistic accurately represents the underlying data?
Let’s look at two sets of numbers, Scenario 1 and Scenario 2 showing how often users are using a feature per day:
Both scenarios have an average of 3.
So should we say that on average in both scenarios users on average are using the feature 3 times per day?
For the first scenario, 3 feels like a fair compression of the data because the data is fairly normally distributed. As an analyst though it’s still good to know what the maximum and minimum values are if you get asked more in-depth questions about the users’ behavior.
For the second Scenario, “3” feels completely inaccurate compression because the distribution of the data is highly skewed and you may consider 11 to be an outlier that should be excluded. If we excluded 11, 1 would be a very appropriate summary stat for scenario 2 since all underlying numbers were 1. If we don’t exclude 11 we would need to provide more context whenever presenting 3 as the summary stat because we really have one user (user 5) using the feature a lot and the rest not using it regularly.
With small data sets such as this example, we can look at the data itself to judge the fairness of a summary statistic but when the amount of data you are trying to compress gets big it is best to look at distributions to determine whether a stat fairly represents the data.
Every time you report a statistic, please look at the underlying data or the distribution to judge whether your compression of the data makes sense or not.