The Trite Sciencisms, part 1.

“The amount of energy necessary to refute bullshit is an order of magnitude greater than to produce it.” Alberto Brandolini.

Richard Morey writes here that we are training psychology undergraduates in the art of bullshit, with particular reference to how we teach research methods. I’ve been thinking for some time (I started this post about 3 weeks before I saw that blog!) that the problem is much more pervasive than merely being isolated to psychology undergrads. It affects how our society interacts with research on a more fundamental level.

Ben Goldacre wrote that Gillian McKeith’s “academic” “work” had an ‘air of science’ about it – to the layperson, it seemed ‘sciencey’ and that conferred validity. It had posh words and superscript numbers pointing to footnotes and references. There are a fair  few ‘scientific memes’ (to use it in the original academic sense) that are often repeated by laypeople as if they are making a worthwhile scientific or methodological point, but in reality add nothing of substance and only detract from the debate due to the effort it takes to either refute the usually-meaningless claims, or to simply try and move the debate on from these points.

I call these the Trite Scienceisms; overused, unoriginal statements which have been divorced from their proper context, and have gained cultural cache due to that ‘air of science’ they have, but are also often simple truisms. Critiquing research is a balancing act – what effect will X have had on the conclusion? Can it still be trusted? Does Y counteract X? – while the Trite Scienceisms are often used to merely disregard research out of hand, and that’s why they are harmful.

I’m going to start with the one that annoys me the most, as a statistics nerd.

Correlation does not imply causation.

This statement has its origins in formal logic. In natural language, it more correctly means “correlation [eg temporal correlation] does not mean causation” – the presence of causation cannot be directly inferred from the presence of correlation alone. A correlation can exist between two variables when, in reality, they are both caused by a third.

In natural language, however, the meaning has been warped to be something more like “correlations are not evidence of causation”, which is used to dismiss all non-experimental research.

The main reason this annoys me is because it is, statistically, gibberish. Inference of causality is done purely by logical criteria (eg cause precedes effect), not statistical. The correlation coefficient r and standardised mean difference d are directly related, and you can convert one to the other quite easily (r = d/[sqrt(d^2+a)], where a is a separately calculated correction factor for varying sample sizes). In fact, r is often used as the effect size measure of the t-test, used to test differences in the means of two samples (as in an experimental study). The statistics used in observational and experimental studies are frequently interchangeable (albeit some are easier to understand in specific contexts), and thus neither alone can demonstrate causality.

More insidious, however, is that this argument is used to invalidate swathes of research that cannot be done experimentally – usually for ethical reasons, or because the topic is not one amenable to experimental manipulation. In its lay use, it is equally valid when applied to the relationship between socioeconomic deprivation and rates of depression as it is when applied to the relationship between the rate of assaults and rate of icecream sales. For one of these, a logical case can be made to demonstrate causality (because all arguments of causality must be based on logic), but both are invalidated by this, my most hated Trite Scienceism.

There are, thankfully, other things you can say in place of it! A lot of the time, people mean that they think an observed relationship is ‘spurious’  – that a causal relationship hasn’t been logically shown. Ask why they’re saying X causes Y, instead of simply saying observing a relationship does not allow inference. Of course, most of the time they don’t read the study and rely on media reporting (more later), when the issue of causality would be addressed.

The study had a small sample size.

This is another one that annoys me as a statistics nerd. This is another truism deployed against any piece of research that someone simply wants to dismiss for whatever reason. This criticism has its place in certain contexts – statistical power analysis, various qualitative methodologies – but is a meaningless statement when out of this context.

Consider two hypothetical populations; one has a population mean of 100 and an SD of 10. One has a mean of 105 and an SD of 10. We can’t know what the population statistics are, so we sample them. In our first study, we sample 10 people from each population [using randomly generated data], giving us M=98.46 SD=13.42 vs M=106.87 SD = 3.13. There is no significant difference observed in the samples, t(18) = -1.930, p =.07.

Lets try this again, with a sample of 100 in each group. This gives us M=97.96 SD = 9.91 vs M=105.26 SD = 9.72. There is a statistically significant difference in this example, t(198) = -5.260, p <.001. This is what larger sample sizes do – they allow us to smaller differences, to be more certain of their existence. In the first example, the chance of observing a difference as large or larger than we did assuming both population means were equal was 7%, while in the second example the probability was 0.1%. One of these passes the standard psychological science threshold of 5%, but the smaller sample doesn’t.

What if the second population mean was 150 instead of 105? This gives us M=98.51 SD = 11.27 vs M=151.52 SD = 6.96, with 10 samples from each population. In this instance, there is a significant difference, t(18) = -12.66, p <.001.

The issue in the first example was never the small sample size, but the power of the analysis. Power refers to the ability of a statistical analysis to detect an effect that truly exists. Increasing sample size is one method you can increase statistical power, but there are others, such as increasing the effect size which is essentially what we did in the final example. In the real world, you can do this by, for example, administering higher doses of an experimental drug.

Conversely, large sample sizes can be a cause for concern but are never criticised because they are seen as inherently better. In this example, we have two populations with identical means (100) and standard deviations (10), from which we randomly sample 2000 cases each. This gives us an observed mean 100.35 SD=9.90 vs M=99.95 SD=10.09. This difference is statistically significant t(3998) = 2.204, p =.028.

Meanwhile, if we only used 10 cases from each population, the result is not significant t(18) = 1.088, p =.29.  Thus, in this example, the smaller sample gives us the true answer while the large sample gives us a spurious, chance result.

What larger sample sizes do is let us estimate population parameters (eg population mean height) with more precision. In the above example, if we use 20 samples, we can say with 95% certainty that the population mean is between 94.24 and 103.55 (we ‘know’ the true mean is 100, because we generated the population but in ‘real life’ we wouldn’t know). With 4000 cases, we can say with 95% certainly that the true mean is between 99.69 and 100.31. If you’re testing for difference (eg ‘does drug 1 have fewer side effects than drug 2?’), small sample sizes are often fine, and won’t lead to clinically unimportant (eg 0.02% fewer side effects) being shown as statistically significant. If you want to estimate the number of side effects exactly, a larger sample size is better.

The general advice in statistics is to use as small a sample size as you can get away with, calculated from what effect size you are expecting, what significant level, and so on. Adding yet more cases is a known method of ‘fudging’ results – minute, irrelevant differences can pass significant thresholds by chance alone, which is harder in smaller samples, because to pass the threshold in small samples requires larger effect sizes.