'n-1' versus 'n' in sampling variance

DylsexicChciken · November 12, 2014

My book says on average the sampling variance [math]\dfrac{\sum (x- \bar{x})}{n} [/math] is biased because it usually gives estimates smaller than the actual variance of a population. We fix this by dividing the sum by n-1 instead of n: [math]\dfrac{\sum (x- \bar{x})}{n-1} [/math]

Is there a more intuitive or formal explanation of this?

Edited November 12, 2014 by DylsexicChciken

MonDie · November 12, 2014

It didn't make sense to me either. The sample could be considered a population in its own right, so why treat it differently? My guess is that even the best sampling methods tend to reduce variability slightly, but then why not divide by [math]n-(n/?)[/math] instead?

Edited November 12, 2014 by MonDie

John · November 12, 2014

This is called Bessel's correction. The associated Wikipedia article has a section explaining the source of the bias when dividing by n, as well as three proofs, the third of which includes a subsection dealing with the intuition behind the proof.

Bignose · November 12, 2014

I've always intuited it thusly: since you are only taking a sample, you need to be more conservative in the possible variance in the whole population. Whereas if you know the entire population, you don't need to be conservative because you know the variance by definition. To me, it is simply a way of being a little more sure that your sample variance range has captured the true population variance.

MonDie · November 12, 2014

John, your link doesn't go to a Wiki page.

Edited November 12, 2014 by MonDie

studiot · November 12, 2014

Additional to the Wiki article

I was hoping to post a table of Bessel's correction, but I have had to ask how to post a table (http://www.scienceforums.net/topic/86509-posting-a-table/

Bessels correction and the (n-1) is also associated with statistical 'degrees of freedom'.

The ultimate for this is Goset's "Student's t distribution'

Edit I now have the table (thanks Acme) and it is interesting how quickly the correction approaches 1 as the number of samples increases.

Number in sample,n	Bessel's Correction, n/(n-1)
2	2.0
5	1.25
10	1.11111
100	1.01010
1000	1.00100

Edited November 12, 2014 by studiot

mathematic · November 12, 2014

My book says on average the sampling variance [math]\dfrac{\sum (x- \bar{x})}{n} [/math] is biased because it usually gives estimates smaller than the actual variance of a population. We fix this by dividing the sum by n-1 instead of n: [math]\dfrac{\sum (x- \bar{x})}{n-1} [/math]

Is there a more intuitive or formal explanation of this?

[math]E(\dfrac{\sum (x- \bar{x})^2}{n-1}) [/math] = variance, when [math]\bar{x}[/math] is the sample average.

Note your expressions left out the squaring.

MonDie · November 12, 2014

Number in sample,n Bessel's Correction, n/(n-1)
22.0
51.25
101.11111
1001.01010
10001.00100

That's exactly my point. It's going to exaggerate the standard deviation less if your sample is larger. How is sample size related to the limiting effects of selection bias (if that's what this is correcting for)?

studiot · November 13, 2014

Chicken

Is there a more intuitive or formal explanation of this?

Mondie

How is sample size related to the limiting effects of selection bias (if that's what this is correcting for)?

Both the full population and the sample have a mean and a variance (or standard deviation).

There is no reason for these two parameters to be the same in both the population and the sample or between two samples unless the sample size equals the whole population.

If we take the variance of the sample to be

[math]{T^2} = \sum {\frac{{{{\left( {{X_i} - \overline X } \right)}^2}}}{n}} [/math]

We would wish it to be equal to the variance of the population, [math]{\sigma ^2}[/math]

However some several lines of algebra shows it to be actually equal to

[math]{\sigma ^2} - \frac{{{\sigma ^2}}}{n} = \frac{{n - 1}}{n}{\sigma ^2}[/math]

So if we 'correct' this deficiency by multiplying this by

[math]\frac{n}{{n - 1}}[/math]

we obtain the wanted equality.

You can see that Bessel's correction is equivalent to using (n-1) instead of n in the calculation of sample variance.

Do you really want the algebra proof?

Edited November 13, 2014 by studiot

MonDie · November 13, 2014

Do you really want the algebra proof?

I don't want a proof of the solution so much as I want to understand the problem itself. I don't see what the correction fixes.

Edited November 13, 2014 by MonDie

studiot · November 13, 2014

OK so what do we actually want to measure when we sample?

In other words why do we sample?

Well we don't want the actual value for one item. We want a single number that will best represent the whole population.

So we want the population average or mean, [math]\mu [/math].

This is given by the formula

[math]\mu [/math] = [math]\sum {\frac{{\left( {{X_i}} \right)}}{N}} [/math]

That is we add all the individual values, x_i up and divide by the number of values in the population.

But we also (often) want an idea of the spread of the data.

We obtain this as the variance (often reported as the standard deviation, [math]\sigma [/math] or square root of the variance) and given by the formula

[math]{\sigma ^2} = \sum {\frac{{\left( {{X_i} - \mu } \right)}}{N}} ^2[/math]

That is we add up all the deviations, square, and divide the result by the number of values in the population.

But what about the sample?

Using upper case letter to denote values from the population, and lower case for values from the sample:

If we did the same for only some of the values would be be fairly representing the population mean and variance?

Well it turns out that if we took every possible sample of size n < N we find that the average of all the sample means of size n is the same as the population average, [math]\mu [/math], although the sample mean for any particular sample may not be the same as that of the population.

But

If we take the average variance of all possible samples of size n < N we find is it smaller than the population variance. [math]{\sigma ^2}[/math].

Remembering that we are really interested in the parameter for the population, not the individual sample we find that we can take the sample average as a fair representation of the population average,

But, and this is what we want to 'fix'

We cannot take the variance of the sample as calculated by the formula

[math]{\sigma _s}^2 = \sum {\frac{{\left( {{x_i} - \mu } \right)}}{n}}^2 [/math]

as a fair representation of the population variance.

Instead of algebra to prove this for all cases the attachment shows a worked example for a very simple case of the population being three numbers {10,20,30}

and the sample size being two numbers. So N = 3 and n = 2

It can be seen that the mean of all the sample means is the same as the population mean,

but the average variance of all the samples is only half that of the population variance.

It can also be seen that bessels correction for this is exactly 2.

[math]\frac{n}{{\left( {n - 1} \right)}} = \frac{2}{{\left( {2 - 1} \right)}} = 2[/math]

Please also note I have tried to bring out when to use the N or n and when to use (n-1) - we don't use (N-1).

Edited November 13, 2014 by studiot

John · November 13, 2014

John, your link doesn't go to a Wiki page.

Thanks for letting me know. The forum software is a bit weird with links containing apostrophes. Since I can no longer edit my post, here is the corrected link: Bessel's correction.

Edited November 13, 2014 by John

MonDie · November 13, 2014

Studiot, the attachment is probably exactly what I wanted, but I'll have time later.

btw, your formula doesn't square the deviations.

^{2}

studiot · November 13, 2014

btw, your formula doesn't square the deviations.

^{2}

Glad you were awake enough to spot the deliberate mistake, now corrected.

Hope the rest is helpful, read the attachment in conjunction with the text in the post.

MonDie · November 14, 2014

Studiot, I understand your attachment. Calculating the sample variances (s) with n-1 leads to an average s that equals the population variance ([math]\mu[/math]). However, giving an average that is closer does not mean it's a better estimate.

Using those same numbers... If you calculate the average absolute deviation of s from [math]\mu[/math], i.e. the value of [math]\frac{\sum|s_{i} - \mu|}{n}[/math], or even if you find the square root of the average of the error squared, [math](\frac{\sum(s_{i} - \mu)^{2}}{n})^{0.5}[/math], you find that n-1 produces more error (59.44, 74.5) than n-0 (48.3, 50.22).

Edited November 14, 2014 by MonDie

mathematic · November 14, 2014

I don't want a proof of the solution so much as I want to understand the problem itself. I don't see what the correction fixes.

The correction corrects for the fact that the sample average is not the true mean. The definition of variance is based on sample differences from the true mean. However when we don't know the true mean we estimate it by using the sample average. Using n-1 results in the estimate of the sample variance being fair, that is the average of the sample variance equals the true variance.

MonDie · November 14, 2014

Yes, I see that now (although you seem to define "variance" differently). That last post was an edit rollercoaster because I was confusing the population n with the sample n.

From studiot's numbers, however, it looks like n-0 is preferable if I want to estimate the population mean from only one sample.

Edited November 14, 2014 by MonDie

studiot · November 14, 2014

From studiot's numbers, however, it looks like n-0 is preferable if I want to estimate the population mean from only one sample.

You do not correct the sample mean, only the sample variance.

So you always use n-0 to calculate the sample mean.

Please also note I have tried to bring out when to use the N or n and when to use (n-1) - we don't use (N-1).

The issue is, as mathematic pointed out, that the mean of a single sampling will probably not match the mean of the whole population exactly.

In my example, although the population mean is the most common value amongst the sample means, an individual sample mean equals the population mean in only 1/3 of the possible samples.

With only one sample you cannot estimate the variance or standard deviation, unless N = n = 1.

MonDie · November 14, 2014

I made a mistake. I was using [math]\mu[/math] where I should have used [math]\sigma^{2}[/math], but I was still talking about variance. If you take the average value of [math]|s_{i}-\sigma|[/math] or [math]|s_{i}^{2}-\sigma^{2}|[/math], you find that Bessel's correction results in a higher average error (with the numbers given).

Edited November 14, 2014 by MonDie

Bignose · November 14, 2014

MonDie, the problem is that you can't just average sample variances like this and expect the result to be meaningful.

Take a look at http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html

You are expecting

[math]\int_{POP} (x-\mu_{POP})^2 f(x) dx = \int_{SAMP} y \int_{\Omega_y} (x-\mu_{y})^2 f(x) dx dy [/math]

where the LHS is the population variance and the RHS is the average of sample variances (the inner integral represents that variance calculation over a single sample).

And there is no real reason why the two should be equal in the general case.

MonDie · November 14, 2014

BigNose, in the link, are those brackets for absolute value or a floor/ceiling function?

Bignose · November 14, 2014

BigNose, in the link, are those brackets for absolute value or a floor/ceiling function?

Neither. Just square brackets used so that they look different than regular parentheses.

MonDie · November 14, 2014

MonDie, the problem is that you can't just average sample variances like this and expect the result to be meaningful.

Take a look at http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html

You are expecting

[math]\int_{POP} (x-\mu_{POP})^2 f(x) dx = \int_{SAMP} y \int_{\Omega_y} (x-\mu_{y})^2 f(x) dx dy [/math]

where the LHS is the population variance and the RHS is the average of sample variances (the inner integral represents that variance calculation over a single sample).

And there is no real reason why the two should be equal in the general case.

Regarding the link, I don't understand why they're summing the sample variance [math]S_{i}^{2}[/math] with the difference between means squared [math](\bar{x}_{i} - \bar{X})^{2}[/math]

Bignose · November 15, 2014

Regarding the link, I don't understand why they're summing the sample variance [math]S_{i}^{2}[/math] with the difference between means squared [math](\bar{x}_{i} - \bar{X})^{2}[/math]

You dropped the subscript c in on the X term. That is important. See the definition of X_c

MonDie · November 16, 2014

I didn't represent it correctly, but I had the correct meaning in my head.

BigNose, I didn't know any calculus until your integrals spurred some independent learning. Now I understand that the integral of an interval of a function is the mean y-value multipled by the interval (I avoided the words "area" and "volume" because my understanding is that the lower quandrants are scored negatively). I also read the first six lessons of Capn's derivatives tutorial. Although your equation may not be true, I do want to know how you were using integrals to represent variance. I'm not familiar with integral notation yet.

Edited November 16, 2014 by MonDie

Sign In

'n-1' versus 'n' in sampling variance

Recommended Posts

DylsexicChciken

MonDie

John

Bignose

MonDie

studiot

mathematic

MonDie

studiot

MonDie

studiot

John

MonDie

studiot

MonDie

mathematic

MonDie

studiot

MonDie

Bignose

MonDie

Bignose

MonDie

Bignose

MonDie

Create an account or sign in to comment

Create an account

Sign in

Browse

Activity

Important Information