Cyrus Posted March 24, 2004 Posted March 24, 2004 I'm having difficulties interpreting the two-tailed results from a independent t-test. The test statistic shows: -2.459 and the p-value is 0.020. Can you have a negative test statistic? But it still means it's a significant difference right? Because the p-value (0.020/2 = 0.01) is smaller than 0.05. I'd appreciate your help, as I'm a total muppet when it comes to the statistic parts of psychology
Glider Posted March 24, 2004 Posted March 24, 2004 A t value can be positive or negative. Basically t is a continuum, with zero in the centre, where zero indicates the means of the two samples are identical. The further away from zero t is, in either direction, the greater the difference between the means of the two samples of data. The value for p (significance) comes from a different test, but SPSS does both and presents all the statistics you need to report; t, df and p. In Psychology, the cut-off point for significance is 0.05 (5%), so any p value less than 0.05 is considered significant. If you are doing an independent groups t test (e.g. males Vs females) and measured reaction times, you wold for example code males as 1 and females as 2. Say this gives a result of t = -2.459. The negative value for t comes simply from the way you coded your groups. If you coded females as 1 and males as 2 and re-ran the same test on the same data, you would get the result t = 2.459 (same magnitude, just at the other end of the continuum). Essentially, t = - 2.459, p = 0.02 indicates a significant (but not very large) difference between your samples.
wolfson Posted March 24, 2004 Posted March 24, 2004 Yes, the t-statistic can take negative values. It doesn't affect the test itself, which is based on the absolute value of the statistic. I'd avoid using p-values, myself. Textbooks often quote something along the lines of: "A p-value < 0.05 indicates a significant result" I object to this for two reasons: a) The threshold value 0.05 is completely arbitrary, and has no particular significance. b) There are usually much better tests available. The only reason p-values are used is because these tests are often more complicated to implement; they have few desirable statistical properties. They're usually abused to attempt to give significance to very small samples. The use of p-values has even been blamed for a spate of apparent 'breakthroughs' in pharmaceutical trials a few years back. What happened was that a large number of initial trials reported supposedly 'significant' results based on the 'p<0.05' criterion; however larger and more extensive trials revealed that in most cases there was no actual benefit to patients. In summary, forget about p-values, they're complete and utter rubbish.
Glider Posted March 25, 2004 Posted March 25, 2004 wolfson said in post # : Yes, the t-statistic can take negative values. It doesn't affect the test itself, which is based on the absolute value of the statistic. But the sign does indicate the direction of difference, (A greater than B or A less than B) so it's still important. I'd avoid using p-values, myself. Textbooks often quote something along the lines of: "A p-value < 0.05 indicates a significant result" I object to this for two reasons: a) The threshold value 0.05 is completely arbitrary, We know that the cut-off value of 0.05 is arbitrary, and as such, we know it is not fixed, and can be changed (reduced, never increased) according to the requirements of the particular experiment. That the alpha value is simply a convention is not grounds for rejection. There is a reason for that convention. ...and has no particular significance. I beg to differ. Research is based on probability. Hypotheses are formulated to be disproved, but they can never be proven, only supported. Due to the absence of certainty in research, it attempts to walk the fine line between type-I and type-II error. A type one error is where the reseacher fails to reject the null hypothesis when it is true, and the probability of this is denoted as alpha. A type two error is where the researcher accepts the null hypothesis when it is false. The probability of this is denoted as beta. As you say, the alpha level of 0.05 is arbitrary, but alpha and beta probabilities are inversely related. If you reduce alpha too far, you increase beta to unacceptable levels. If you reduce beta too far, you increase alpha to unacceptable levels. This is why the maximum level for alpha has been (arbitrarily) fixed at 0.05. Power analysis is used to reduce beta as far as possible. It is for the researcher to decide which error is the most important depending upon what the experiment is testing. However, most of the problems occur when people misinterpret the meaning of p. If you conduct an experiment with alpha = 0.05 and the results show p = 0.04, this does not mean your hypothesis is proven, it means only that the probability that the result happened by chance is 4%. In other words, you (the researcher) have a four percent chance of being wrong if you accept your alternative hypothesis (reject the null hypothesis). Moreover, alpha can be set at any lower level you wish, depending on which you consider to be the most important error, or the error with the most dire implications; failing to detect an effect that exists (type-II), or detecting an effect which does not exist (type-I). In areas that generate less noisy data than Psychology, alpha is often set by the researchers at 0.01, or even 0.001 b) There are usually much better tests available. The only reason p-values are used is because these tests are often more complicated to implement; they have few desirable statistical properties. They're usually abused to attempt to give significance to very small samples. Rubbish. Very small samples increase beta and reduce alpha. In other words, the smaller the sample, the less representative of the population it is, and the greater the chance you will fail to detect an effect that does exist within that population (type-II error). Sample size, effect size and power are all interrelated. Sample size is a key factor in the power of an experiment to detect an effect (avoid a type-II error). Take for example Pearson's product moment correlation. Any first year knows that using a small sample, one can generate large correlation coefficients ® that will fail to reach significance, but using a large sample, comparatively small values for r can reach significance. The use of p-values has even been blamed for a spate of apparent 'breakthroughs' in pharmaceutical trials a few years back. What happened was that a large number of initial trials reported supposedly 'significant' results based on the 'p<0.05' criterion; however larger and more extensive trials revealed that in most cases there was no actual benefit to patients. The same has been shown for a number of antidepressant drugs. Under strictly controlled and finely measured conditions, the drugs have a small but statistically significant effect. However, when used in the real world (outside of laboratory conditions) these drugs had no measurable (clinical) effects. In fact they would still have been having some effect, but that would have been buried under the 'noise' of real-world conditions (i.e. the original experiments lacked ecological validity). What you are talking about here is the difference between statistical and clinical significance, and also the failure of some people to understand the difference between statistical significance and effect size. Effect size and statistical significance are different things. To generate p less than 0.0005 (for example), does not mean there is a large effect. It simply means the chances that you are wrong in rejecting the null hypothesis are very small. As shown by the Pearson's example, small effects can achieve statistical significance if the experiment is of sufficient power. All a low value for p means is that there is a greater probability that the effect does exist, not that it is large. Effect size is calculated by dividing the standardized difference between samples by the standard deviation, and is completely different from p. A statistically significant but small effect can be genuine, but have no clinically significant effect. In other words, to be clinically significant, an effect needs to be of sufficient size to have a measurable effect in real-world application (i.e. an effect that is detectable through all the other 'noise' generated outside of controlled conditions). In summary, forget about p-values, they're complete and utter rubbish. 1) This is not the kind of advice you should be offering to second year reseach methods students. Mainly because... 2) You are wrong. Errors concerning statistical significance occur through misuse and abuse by people who have failed to understand the priciples and functions of reserch methods and statistics. Out-of-hand rejection of basic statistical principles tend to come from the same people.
wolfson Posted March 25, 2004 Posted March 25, 2004 Ok so I may have been slightly too fluent with my wording, however we all know that does occur. The p-value is the probability that an product as large as or larger than that experiential would occur in a correctly planned, executed, and analysed analytical research if in reality there was no distinction between the groups, i.e., that the outcome was due completely to possibility inconsistency of persons or capacity alone. A p-value isn’t the probability that a given result is wrong or right, the probability that the result occurred by chance, or a rate of the clinical implication of the results. A very small p-value cannot compensate for the occurrence of a large amount of systematic error (bias). If the opening for bias is large, the p-value is likely unfounded and irrelevant. Also p-values may be unreliable, because they correspond to events that have not been explored by the model in the available control integrations. (Referenced, Durk .M, Advanced statistical analysis 2001, & Olive .L, An introduction to research methods and statistical error 2003 & Moore .D, McCabe .G, Introduction to the practice of statistics 2nd edition 1993).
Glider Posted March 26, 2004 Posted March 26, 2004 wolfson said in post # : Ok so I may have been slightly too fluent with my wording, however we all know that does occur. Yes, that's true, and it is a problem, but it's more due to misuse by researchers than an inherent problem with the statistic. The p-value is the probability that an product as large as or larger than that experiential would occur in a correctly planned, executed, and analysed analytical research if in reality there was no distinction between the groups, i.e., that the outcome was due completely to possibility inconsistency of persons or capacity alone. A p-value isn’t the probability that a given result is wrong or right, the probability that the result occurred by chance, or a rate of the clinical implication of the results. It is both, if you think about it. P stands for both proportion (under the normal distribution) and denotes an area of the distribution within which your effect probably falls (assuming, as you rightly point out, the research is sound and free of any confounds or bias, which is extremely rare). By the same token it denotes the probability that you will be wrong if you reject the null hyothesis. It's essentially the same thing worded differently. A very small p-value cannot compensate for the occurrence of a large amount of systematic error (bias). If the opening for bias is large, the p-value is likely unfounded and irrelevant. Of course it can't. But then the same can be said of any test statistic. The presence of a confounding variable, i.e. an uncontrolled systematic error renders all results worthless. Also p-values may be unreliable, because they correspond to events that have not been explored by the model in the available control integrations. (Referenced, Durk .M, Advanced statistical analysis 2001, & Olive .L, An introduction to research methods and statistical error 2003 & Moore .D, McCabe .G, Introduction to the practice of statistics 2nd edition 1993). A p value will only be as reliable as the experiment that generated the data it is based upon. The principle problem is the failure by many researchers to understand this, i.e. that although stats software often presents precice 'exact p' values to three or four decimals places, this does not mean it is accurate. It is only as good as the data that it is based upon. If sample selection, data collection or any number of other factors are flawed, then any results based upon those data will also be flawed. The main debate is that given there will be noise in any experimental data, how much weight should be given to precise values and rejection levels? For example, Howell (1997) states in his book: "The particular view of hypothesis testing described here is the classical one that a null hypothesis is rejected if its probability is less than the predefined significance level, and not rejected if its probability is greater than the significance level. Currently a substantial body of opinion holds that such cut-and-dried rules are inappropriate and that more attention should be paid to the probability value itself. In other words, the classical approach (using a .05 rejection level) would declare p = .051 and p = .150 to be (equally) "nonsignificant" and p = .048 and p = .0003 to be (equally) "significant." The alternative view would think of p = .051 as "nearly significant" and p = .0003 as "very significant." While this view has much to recommend it, it will not be wholeheartedly adopted here. Most computer programs do print out exact probability levels, and those values, when interpreted judiciously, can be useful. The difficulty comes in defining what is meant by "interpreted judiciously"." So it seems that whilst p values are useful, potential problems stem from inappropriate interpretation of them. Howell, D. C., (1997). Statistical Methods for Psychology. (4th Ed.). International: Wadsworth.
wolfson Posted March 26, 2004 Posted March 26, 2004 As I mentioned, my most obvious objection is to the whole 'p<0.05' criterion. As I mentioned, there is absolutely no reason why p<0.05 should indicate 'significance' and p>0.05 should not, the threshold value is completely arbitrary. The statistician who first suggested the use of p-values just picked the number 0.05 for no reason other than that it's small. My second objection is that there's often no clearly defined way of calculating p-values accurately. A p-value is supposed to indicate the probability that the observations happened 'by chance', the idea of the 'p<0.05' criterion is that if the probability of the results happening 'by chance' is low, then you've likely observed something 'significant.' The trouble is, how do you calculate the probability that something happened 'by chance,' when all you have is a set of data? You have to assume that the data was generated by some particular model, and estimate the probabilities based on that model. There's nothing immediately wrong with doing that, (in fact it's only very recently that techniques have been devised that don't require you to do this) but to calculate accurate p-values you would need to know the precise model generating the data, rather than just restricting to a particular class of models, as other tests, such as the t-test, do (apologies if this sounds very vague, but things would get very long-winded if I were to try to be more precise). So to calculate (estimate, actually) the p-values you have to pick a model based on the data you have, then use the data again to estimate the p-values; you're using the data twice. That's not necessarily a bad thing in itself (I can think of several good techniques that do this); but it's certainly dubious, and needs theoretical justification, which is lacking in the case of p-values. But my biggest objection to p-values is their scope for abuse. P-values don't give you any estimate of how accurate or otherwise your findings are; they just say 'significant' or not, with no indication of how accurate this answer is (you can't define a confidence interval based on a p-value, but you can for the t-test and others). A basic principle that I adhere to is that unless a test can give you some idea of how accurate its conclusions are you can't rely on anything it tells you. Statistical inference just isn't as straightforward as saying 'significant' or not. 1
Glider Posted March 27, 2004 Posted March 27, 2004 wolfson said in post # : As I mentioned, my most obvious objection is to the whole 'p<0.05' criterion. As I mentioned, there is absolutely no reason why p<0.05 should indicate 'significance' and p>0.05 should not, the threshold value is completely arbitrary. The statistician who first suggested the use of p-values just picked the number 0.05 for no reason other than that it's small. That's right. Because it is small and results in a small chance of a type-I error. The precise value for alpha can be set at whatever value the researcher thinks is appropriate (although it would be extremely unwise to increase alpha). That the alpha is simply a convention is not grounds for rejecting it. Moreover, it means that if you don't agree with it, you can change it. The responsibility lies with the resercher, not with the concept of significance. My second objection is that there's often no clearly defined way of calculating p-values accurately. A p-value is supposed to indicate the probability that the observations happened 'by chance', the idea of the 'p<0.05' criterion is that if the probability of the results happening 'by chance' is low, then you've likely observed something 'significant.' That's right. This comes down to the definition of 'significance'. As alpha is only a convention, 'significance' means only that should a researcher consider a result significant then there is an acceptably low probability of falsely rejecting the null hypothesis. It does not mean that an effect showing p less than 0.05 is 'true' and p greater than 0.05 is 'false'. The trouble is, how do you calculate the probability that something happened 'by chance,' when all you have is a set of data? You have to assume that the data was generated by some particular model, and estimate the probabilities based on that model. There's nothing immediately wrong with doing that, (in fact it's only very recently that techniques have been devised that don't require you to do this) but to calculate accurate p-values you would need to know the precise model generating the data, rather than just restricting to a particular class of models, as other tests, such as the t-test, do (apologies if this sounds very vague, but things would get very long-winded if I were to try to be more precise). Most comparisons are made against known population parameters; i.e. normative values which are known within a given population. If your sample population is representative then these values (mean and SD) will be the same within your sample and the population. You apply your intervention and and measure the degree to which these values have changed (e.g. a t-test). The t-test provides a statistic which indicates the magnitude of difference (you assume is) caused by your experimental intervention and is a test of whether the two samples of data are from a broad, single distribution (i.e. one population with high varience) or from two separate (although overlapping) distributions. Alpha is set to provide the best balance between the probabilities of type-I and type-II errors. If it is set at 0.05 and your t statistic is shown to significant at alpha = 0.05, that simply means you have an acceptably low probility of being wrong by rejecting the null hypothesis (A = B) in favour of the alternative (A /= B). The p-value is simply the probability that an effect as large as or larger than that observed would occur by chance if there were no other differences between the groups. By inversion, p = 0.05 denotes a 95 % chance that the difference was due to the experimental intervention. But as I said, depending on what is being measured, it is for the researcher to decide whether a 5% chance of being wrong is acceptable. Software may provide apparently highly precise values for p, but that does not mean they are accurate (if you think about it, the term 'accurate measure of probability' is almost an oxymoron). They can only be as accurate as your sample is representative and even then are subject to the effects of error and noise. Nor does it mean that they have to be accepted. The researcher must decide which error is the most important (type-I or II), and should also have an idea of how much of a difference their intervention must cause for it to be acceptable (i.e. they should have calculated an acceptable effect size). p values are simply a guide. Over-reliance on p values and computer output is no substitute for common sense and sound data and I admit freely that there are many researchers who depend entirely upon their SPSS output as the be-all and end-all, without considering other factors. So to calculate (estimate, actually) the p-values you have to pick a model based on the data you have, then use the data again to estimate the p-values; you're using the data twice. That's not necessarily a bad thing in itself (I can think of several good techniques that do this); but it's certainly dubious, and needs theoretical justification, which is lacking in the case of p-values. All statistics are estimates. The whole point of them is to be able to achieve reasonable estimates of population variables to avoid the need for measuring every individual within a population. Statistics are based on sample estimates of population perameters. Even the calculation for the sample SD (square-root of (sum of (x - mean) squared), divided by (n-1)) uses n-1 as an attempt to account for error, which makes sample SD an estimate of the population SD (where the whole lot is divided by N). The validity of any statistic (including p) depends upon the validity of your data, and the degree to which your sample is truly representative of the population of interest. But my biggest objection to p-values is their scope for abuse. P-values don't give you any estimate of how accurate or otherwise your findings are; they just say 'significant' or not, with no indication of how accurate this answer is (you can't define a confidence interval based on a p-value, but you can for the t-test and others). A basic principle that I adhere to is that unless a test can give you some idea of how accurate its conclusions are you can't rely on anything it tells you. Statistical inference just isn't as straightforward as saying 'significant' or not. p values do not say 'significant or not'. They provide an estimate of the probability that the difference you observed could happen by chance alone. It is the researcher who must decide whether or not that probability is low enough to reject the null hypothesis with a reasonable degree of safety and whether it is safe to assume that the difference was due to their intervention and not some other factor(s). In order to make that decision, the researcher has to take into consideration the power of the experiment, the effect size, the sampling method, the type of data, the level of measurement, possible confounds, the degree of noise and all the other factors that may influence the outcome and validity of the experiment. As you said yourself, alpha is arbitrary, so by definition it is the researcher's decision. I agree, there is a large scope for abuse, but then that's true of all statistics; there are lies, damn lies and then there are statistics (check out the use politicians and advertisers make of stats.). But people abuse the statistics. To blame the statistics themselves is a bit pointless. In fact, the greater the understanding of the underlying principles of statistics, the less the scope for abuse (unless you're an unscrupulous git of course; cf the politician and advertisers thing).
wolfson Posted March 29, 2004 Posted March 29, 2004 On this occasion we will have to agree to disagree.
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now