DylsexicChciken Posted November 12, 2014 Posted November 12, 2014 (edited) My book says on average the sampling variance [math]\dfrac{\sum (x- \bar{x})}{n} [/math] is biased because it usually gives estimates smaller than the actual variance of a population. We fix this by dividing the sum by n-1 instead of n: [math]\dfrac{\sum (x- \bar{x})}{n-1} [/math] Is there a more intuitive or formal explanation of this? Edited November 12, 2014 by DylsexicChciken 1
MonDie Posted November 12, 2014 Posted November 12, 2014 (edited) It didn't make sense to me either. The sample could be considered a population in its own right, so why treat it differently? My guess is that even the best sampling methods tend to reduce variability slightly, but then why not divide by [math]n-(n/?)[/math] instead? Edited November 12, 2014 by MonDie
John Posted November 12, 2014 Posted November 12, 2014 This is called Bessel's correction. The associated Wikipedia article has a section explaining the source of the bias when dividing by n, as well as three proofs, the third of which includes a subsection dealing with the intuition behind the proof. 1
Bignose Posted November 12, 2014 Posted November 12, 2014 I've always intuited it thusly: since you are only taking a sample, you need to be more conservative in the possible variance in the whole population. Whereas if you know the entire population, you don't need to be conservative because you know the variance by definition. To me, it is simply a way of being a little more sure that your sample variance range has captured the true population variance. 1
MonDie Posted November 12, 2014 Posted November 12, 2014 (edited) John, your link doesn't go to a Wiki page. Edited November 12, 2014 by MonDie
studiot Posted November 12, 2014 Posted November 12, 2014 (edited) Additional to the Wiki articleI was hoping to post a table of Bessel's correction, but I have had to ask how to post a table (http://www.scienceforums.net/topic/86509-posting-a-table/Bessels correction and the (n-1) is also associated with statistical 'degrees of freedom'.The ultimate for this is Goset's "Student's t distribution'Edit I now have the table (thanks Acme) and it is interesting how quickly the correction approaches 1 as the number of samples increases. Number in sample,n Bessel's Correction, n/(n-1) 2 2.0 5 1.25 10 1.11111 100 1.01010 1000 1.00100 Edited November 12, 2014 by studiot
mathematic Posted November 12, 2014 Posted November 12, 2014 My book says on average the sampling variance [math]\dfrac{\sum (x- \bar{x})}{n} [/math] is biased because it usually gives estimates smaller than the actual variance of a population. We fix this by dividing the sum by n-1 instead of n: [math]\dfrac{\sum (x- \bar{x})}{n-1} [/math] Is there a more intuitive or formal explanation of this? [math]E(\dfrac{\sum (x- \bar{x})^2}{n-1}) [/math] = variance, when [math]\bar{x}[/math] is the sample average. Note your expressions left out the squaring.
MonDie Posted November 12, 2014 Posted November 12, 2014 Number in sample,n Bessel's Correction, n/(n-1) 22.051.25101.111111001.0101010001.00100 That's exactly my point. It's going to exaggerate the standard deviation less if your sample is larger. How is sample size related to the limiting effects of selection bias (if that's what this is correcting for)?
studiot Posted November 13, 2014 Posted November 13, 2014 (edited) Chicken Is there a more intuitive or formal explanation of this? Mondie How is sample size related to the limiting effects of selection bias (if that's what this is correcting for)? Both the full population and the sample have a mean and a variance (or standard deviation). There is no reason for these two parameters to be the same in both the population and the sample or between two samples unless the sample size equals the whole population. If we take the variance of the sample to be [math]{T^2} = \sum {\frac{{{{\left( {{X_i} - \overline X } \right)}^2}}}{n}} [/math] We would wish it to be equal to the variance of the population, [math]{\sigma ^2}[/math] However some several lines of algebra shows it to be actually equal to [math]{\sigma ^2} - \frac{{{\sigma ^2}}}{n} = \frac{{n - 1}}{n}{\sigma ^2}[/math] So if we 'correct' this deficiency by multiplying this by [math]\frac{n}{{n - 1}}[/math] we obtain the wanted equality. You can see that Bessel's correction is equivalent to using (n-1) instead of n in the calculation of sample variance. Do you really want the algebra proof? Edited November 13, 2014 by studiot
MonDie Posted November 13, 2014 Posted November 13, 2014 (edited) Do you really want the algebra proof? I don't want a proof of the solution so much as I want to understand the problem itself. I don't see what the correction fixes. Edited November 13, 2014 by MonDie
studiot Posted November 13, 2014 Posted November 13, 2014 (edited) OK so what do we actually want to measure when we sample? In other words why do we sample? Well we don't want the actual value for one item. We want a single number that will best represent the whole population. So we want the population average or mean, [math]\mu [/math]. This is given by the formula[math]\mu [/math] = [math]\sum {\frac{{\left( {{X_i}} \right)}}{N}} [/math] That is we add all the individual values, xi up and divide by the number of values in the population. But we also (often) want an idea of the spread of the data. We obtain this as the variance (often reported as the standard deviation, [math]\sigma [/math] or square root of the variance) and given by the formula [math]{\sigma ^2} = \sum {\frac{{\left( {{X_i} - \mu } \right)}}{N}} ^2[/math] That is we add up all the deviations, square, and divide the result by the number of values in the population. But what about the sample? Using upper case letter to denote values from the population, and lower case for values from the sample: If we did the same for only some of the values would be be fairly representing the population mean and variance? Well it turns out that if we took every possible sample of size n < N we find that the average of all the sample means of size n is the same as the population average, [math]\mu [/math], although the sample mean for any particular sample may not be the same as that of the population.But If we take the average variance of all possible samples of size n < N we find is it smaller than the population variance. [math]{\sigma ^2}[/math]. Remembering that we are really interested in the parameter for the population, not the individual sample we find that we can take the sample average as a fair representation of the population average, But, and this is what we want to 'fix' We cannot take the variance of the sample as calculated by the formula [math]{\sigma _s}^2 = \sum {\frac{{\left( {{x_i} - \mu } \right)}}{n}}^2 [/math] as a fair representation of the population variance. Instead of algebra to prove this for all cases the attachment shows a worked example for a very simple case of the population being three numbers {10,20,30} and the sample size being two numbers. So N = 3 and n = 2 It can be seen that the mean of all the sample means is the same as the population mean, but the average variance of all the samples is only half that of the population variance. It can also be seen that bessels correction for this is exactly 2. [math]\frac{n}{{\left( {n - 1} \right)}} = \frac{2}{{\left( {2 - 1} \right)}} = 2[/math] Please also note I have tried to bring out when to use the N or n and when to use (n-1) - we don't use (N-1). Edited November 13, 2014 by studiot 2
John Posted November 13, 2014 Posted November 13, 2014 (edited) John, your link doesn't go to a Wiki page. Thanks for letting me know. The forum software is a bit weird with links containing apostrophes. Since I can no longer edit my post, here is the corrected link: Bessel's correction. Edited November 13, 2014 by John 1
MonDie Posted November 13, 2014 Posted November 13, 2014 Studiot, the attachment is probably exactly what I wanted, but I'll have time later. btw, your formula doesn't square the deviations. ^{2}
studiot Posted November 13, 2014 Posted November 13, 2014 btw, your formula doesn't square the deviations. ^{2} Glad you were awake enough to spot the deliberate mistake, now corrected. Hope the rest is helpful, read the attachment in conjunction with the text in the post.
MonDie Posted November 14, 2014 Posted November 14, 2014 (edited) Studiot, I understand your attachment. Calculating the sample variances (s) with n-1 leads to an average s that equals the population variance ([math]\mu[/math]). However, giving an average that is closer does not mean it's a better estimate. Using those same numbers... If you calculate the average absolute deviation of s from [math]\mu[/math], i.e. the value of [math]\frac{\sum|s_{i} - \mu|}{n}[/math], or even if you find the square root of the average of the error squared, [math](\frac{\sum(s_{i} - \mu)^{2}}{n})^{0.5}[/math], you find that n-1 produces more error (59.44, 74.5) than n-0 (48.3, 50.22). Edited November 14, 2014 by MonDie
mathematic Posted November 14, 2014 Posted November 14, 2014 I don't want a proof of the solution so much as I want to understand the problem itself. I don't see what the correction fixes. The correction corrects for the fact that the sample average is not the true mean. The definition of variance is based on sample differences from the true mean. However when we don't know the true mean we estimate it by using the sample average. Using n-1 results in the estimate of the sample variance being fair, that is the average of the sample variance equals the true variance.
MonDie Posted November 14, 2014 Posted November 14, 2014 (edited) Yes, I see that now (although you seem to define "variance" differently). That last post was an edit rollercoaster because I was confusing the population n with the sample n. From studiot's numbers, however, it looks like n-0 is preferable if I want to estimate the population mean from only one sample. Edited November 14, 2014 by MonDie
studiot Posted November 14, 2014 Posted November 14, 2014 From studiot's numbers, however, it looks like n-0 is preferable if I want to estimate the population mean from only one sample. You do not correct the sample mean, only the sample variance. So you always use n-0 to calculate the sample mean. Please also note I have tried to bring out when to use the N or n and when to use (n-1) - we don't use (N-1). The issue is, as mathematic pointed out, that the mean of a single sampling will probably not match the mean of the whole population exactly. In my example, although the population mean is the most common value amongst the sample means, an individual sample mean equals the population mean in only 1/3 of the possible samples. With only one sample you cannot estimate the variance or standard deviation, unless N = n = 1.
MonDie Posted November 14, 2014 Posted November 14, 2014 (edited) I made a mistake. I was using [math]\mu[/math] where I should have used [math]\sigma^{2}[/math], but I was still talking about variance. If you take the average value of [math]|s_{i}-\sigma|[/math] or [math]|s_{i}^{2}-\sigma^{2}|[/math], you find that Bessel's correction results in a higher average error (with the numbers given). Edited November 14, 2014 by MonDie
Bignose Posted November 14, 2014 Posted November 14, 2014 MonDie, the problem is that you can't just average sample variances like this and expect the result to be meaningful. Take a look at http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html You are expecting [math]\int_{POP} (x-\mu_{POP})^2 f(x) dx = \int_{SAMP} y \int_{\Omega_y} (x-\mu_{y})^2 f(x) dx dy [/math] where the LHS is the population variance and the RHS is the average of sample variances (the inner integral represents that variance calculation over a single sample). And there is no real reason why the two should be equal in the general case.
MonDie Posted November 14, 2014 Posted November 14, 2014 BigNose, in the link, are those brackets for absolute value or a floor/ceiling function?
Bignose Posted November 14, 2014 Posted November 14, 2014 BigNose, in the link, are those brackets for absolute value or a floor/ceiling function? Neither. Just square brackets used so that they look different than regular parentheses.
MonDie Posted November 14, 2014 Posted November 14, 2014 MonDie, the problem is that you can't just average sample variances like this and expect the result to be meaningful. Take a look at http://www.emathzone.com/tutorials/basic-statistics/combined-variance.html You are expecting [math]\int_{POP} (x-\mu_{POP})^2 f(x) dx = \int_{SAMP} y \int_{\Omega_y} (x-\mu_{y})^2 f(x) dx dy [/math] where the LHS is the population variance and the RHS is the average of sample variances (the inner integral represents that variance calculation over a single sample). And there is no real reason why the two should be equal in the general case. Regarding the link, I don't understand why they're summing the sample variance [math]S_{i}^{2}[/math] with the difference between means squared [math](\bar{x}_{i} - \bar{X})^{2}[/math]
Bignose Posted November 15, 2014 Posted November 15, 2014 Regarding the link, I don't understand why they're summing the sample variance [math]S_{i}^{2}[/math] with the difference between means squared [math](\bar{x}_{i} - \bar{X})^{2}[/math] You dropped the subscript c in on the X term. That is important. See the definition of X_c
MonDie Posted November 16, 2014 Posted November 16, 2014 (edited) I didn't represent it correctly, but I had the correct meaning in my head. BigNose, I didn't know any calculus until your integrals spurred some independent learning. Now I understand that the integral of an interval of a function is the mean y-value multipled by the interval (I avoided the words "area" and "volume" because my understanding is that the lower quandrants are scored negatively). I also read the first six lessons of Capn's derivatives tutorial. Although your equation may not be true, I do want to know how you were using integrals to represent variance. I'm not familiar with integral notation yet. Edited November 16, 2014 by MonDie
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now