Descriptive Statistics: Measures of Central Tendency and Measures of Dispersion

You have a data-set containing 10K data points where each data point is the house-hold income of a given area for example. But what do you with this data. The most fundamental tools that help in making sense of huge data-sets are “measures of Central Tendency” and “measure of Dispersion”. They are also known as “Descriptive Statics” as they help in better description of raw-data at hand.

Measure of Central Tendency:

Measures of central tendency give you the value which is representative of a given data-set. There are 3 measures of central tendency:

  1. Mean μ
  2. Median
  3. Mode

Below is a link that gives a very good view of measures of central tendency that can be applied.

http://www.abs.gov.au/websitedbs/a3121120.nsf/home/statistical+language+-+measures+of+central+tendency

Well, now you know what measure of central tendency is. But is central tendency in itself is enough to describe a data-set? I don’t think so and so does Hans Rosling when he says in his book Factfulness:

“When we compare two averages, we risk misleading ourselves even more by focusing on the gap between two single numbers, and missing the overlapping spreads, the overlapping ranges of numbers, that make each average.”

Averages (or measures of central tendency) alone can be misleading. Looking at Central Tendencies alone might give you an obscured world view. You need to couple the measure of central tendency with that of the spread (or measures of dispersion).

Measures of Dispersion

Range

Range is simplest measure of dispersion. It is the difference between maximum and minimum values in a given data-set. In case of house-hold incomes it will be difference between lowest income and the highest income. But how effective is the range? you need to be careful in applying range when the data is not evenly spread out. As we know, household income is very unevenly distributed and therefore the range value you get might be much larger because of outliers at both higher and lower ends.

Example: Consider a data-set of household incomes where highest and lowest incomes are as below:

Highest income: 100,00,000 (1 crore or 10 million)

Second highest income: 10,000 ( ten thousand)

Second lowest income: 1000 (one thousand)

Lowest income: 0

The range according to above data will be 1 crore or 10 million (highest minus lowest). But is this an accurate representation of data? No, because much of the data is between 1000 and 10,000 thus making effective range to be 9000. 

A adaptation of range is inter-quartile range where the range is computed only for the middle 50% of the data set.

However, even inter-quartile range can be susceptible to the same issues regarding distribution of data. A better approach will be the one which takes into consideration distance of each of the points from center. My first thought was to compute distance of each data point from mean and then calculate the average of all those distances. The sum of distances of all the data points from center (mean) will be zero. Therefore, I should be using squares of distances to avoid negative distances cancel out the positive distances.

Example: Take data-points whose values are 100, 200, 300 and 400. The mean value is 1000/4 which is 250. Now mean of the distances of each data point from 250 will be zero.

{(100-250)+(200-250)+(300-250)+(400-250)}/4 = zero

so we take mean of the squares of distances and get the value 1250

Variance

The above discussed example is variance. Variance for a given data-set of discrete values can be calculated using below formula. As you can see, the value of variance for a data-set depends collectively on the distance of each of the data points from the mean of that data-set.

Variance Formula

Standard Deviation

Standard Deviation is more of a derivation from variance – it is square-root of variance. Though can be used interchangeably, I believe standard-deviation is preferred over variance only because it is of the same dimensions as the original data-set.

For a data-set with discrete real values standard-deviation can be computed by below formula.

Standard Deviation Formula

Example: Mean of 1,1,2,3,4,4 is given by the expression = (1+1+2+3+4+4)/6

Mean = μ  =  = 2.5

Using the Mean Value computed above we will compute the Standard-Deviation

σ = s = 1.258306

Do the above concepts still seem abstract to you? Please read How to apply Mean and Standard Deviation? for a better understanding.

Hope this helps your understanding. Thank you!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.