Dispersion is basically the spread of the data, the extent to which a distribution is stretched or squeezed. Central tendencies are not enough to fully understand the nature of the entire dataset, hence we need dispersion measures. With both central tendencies and dispersion measures together paint a good picture of the dataset.
Following are the most common dispersion measures:
Range is nothing but the difference between maximum value and minimum value. It tells us within what range the whole dataset lies. Range is very easy and straightforward to calculate but at the same time, it is too sensitive to outliers
R = max(dataset) -min(dataset)
Variance is the measure of spread of data, it is the expectation of squared deviation of a random variable from its mean.
Now, let us try to understand by intuition, without going into mathematics too much, why we divide with “n-1” instead of “n”.
Ideal way to calculate standard deviation from sample dataset would be
But we do not know ‘mu’, hence we use ‘x-bar’ instead. ‘x-bar’ is calculated using only sample dataset, so in real world
Hence we use “n-1” in the denominator to calculate the unbiased standard deviation from sample.
As the unit of variance is squared the unit of data(as obvious from the formula above), the value of variance is not very intuitive, hence we take its square root which is defined as standard deviation.
- If the standard deviation is small, the data has little spread (i.e., the majority of points fall very near the mean).
- If standard deviation = 0, there is no spread. This only happens when all data items are the same value.
- The standard deviation is significantly affected by outliers and skewed distributions.
Thanks for reading the article! Wanna connect with me?
Here is link to my LinkedIn Profile