Collective Fraud in Plain Sight: Bad Statistics in Health Science

11 min readFeb 23, 2021

A disclaimer to be made is that I do not think that this is even intentional, it is way too basic of errors to be a meticulous plan for “liberal indoctrination,” in fact, so basic that people that people may not even believe me.

Statistical Illiteracy

Misinterpretation of statistics is a pervasive phenomenon in the field of health behavior intervention research and the people who this sort of research are not confined to a single topic. These sorts of individuals with a poor sense of statistics have all kinds of publications. Bad interpretations are baked into the cake by preconceived beliefs (which might have nothing to do with health behavior change) that are held by researchers.

A letter to the editors of the Journal of the American Medical Association with bad interpretation of data (Johnson et al., 1997). The statistical power of a study with a sample size of 100 to detect an effect size of 0.35 standard deviations using a two tailed T-test and 0.05 significance is only 0.42. It is more probable that the small samples are farther from the true parameter than the other way around.

See Cugelman et al., 2011 for improper interpretation of this chart — authors ignored statistical power when measuring qualities of studies despite statistical power being part of the Downs and Black Instrument that they used to measure the qualities of the studies that they used!

For example, in the 5th edition of the textbook: “Health Behavior Change: Research, Theory, and Practice,” the authors point out that studies with smaller groups in health behavior change interventions tend to be more effective. While theoretically possible, there is no explanation given for why this might be the case beyond a citation to the disputed belief that smaller class sizes are more effective in education. There is enough evidence to contradict the assertion made about class sizes from both American data and European data (Schanzenbach, 2014; Filges et al., 2018). A causal inference should not be made regarding the relationship between class sizes and success of educational instruction — even if there was one, it should not be extrapolated.

If we are going to be honest, which is the more parsimonious explanation: the smaller sample sizes have larger effects as a result of sampling error and publication bias or there is a causal interaction between group size and the effectiveness of an intervention? I think that it is incredibly difficult to make any sort of rational argument for the latter of the two options being more likely to be true on its face given that the first one is something that has been observed in other fields recently.

How to Interpret Effect Sizes and Odds Ratios

Effect sizes are a measure of the difference that for two groups of people are in the variable being measured. The most common method is using a standardized mean-difference in Cohen’s d units. Cohen’s d = the difference between the two groups divided by the standard deviation. This gives a difference between groups in terms of standard deviations, which allows comparisons to be made for unrelated variables. For example, the effect size of difference in height between men and women within the same ethnic group is normally about 1.7 standard deviations, or 1.7 “Cohen’s d units.”

The interpretation of Cohen’s d has always been 0.2 = Small effect, 0.5 = medium, 0.8 = large. Not all studies use this method, however, sometimes studies will look for a correlation between variables or sometimes they will look at a probability difference in what are known as odds ratios. If the variable that you are looking for is “be six feet tall” the odds ratio for men vs. women, the odds ratio is about 14.5 in American men. This means that men are 14.5 times more likely to be 6 feet tall.

Correlations, Standardized-Mean Differences, and correlations can each be converted from one to the other algebraically, and many people in the fields seem to forget this or have never learned it. In school I never learned how to convert them and I took a university class in health research statistics.

The formulas to convert these numbers might look a little scary to the average person who has rarely used the number pi outside of calculating the dimensions of a circle, but there are websites that can do it for you. Whenever you read a study that gives odds ratios, you should try to think of it in terms of effect sizes.

The Anti Racism Industry Epitomizes Odds Ratios Abuse

Let’s use an example that people have heard some variant of. A study finds that after conducting 100 mock trials with an all white jury, white Americans are 10% more likely to find an African American man guilty if he has dreadlocks, 10% more likely if he has tattoos, and 10% more likely if he has a gold chain on. This is how the media is going to frame odds ratios because it is easily understood. An odds ratio of 0.9 means 10% less likely, 1.1 means 10% more likely, and an odds ratio of 2.0 means 100% more likely. Most Americans, especially those on the left, will read this and see nothing wrong with making conclusions about people being racist based on this data. The reason that we cannot make conclusions based on this data is that there is too low of statistical power to detect this small of an effect.

Statistical power is the probability of finding an effect if there is one. If you have a low probability of finding an effect, it means that finding one is likely the result of a sampling error. This is why there is often a negative correlation between the size of the sample and the effect size detected. Increases the sample size reduces what is known as the standard error of the mean (commonly referred to as a margin of error). Another way to increase the power is to increase the effect size, however, it is hard to increase an effect size without being dishonest, so increasing the sample size is the best way to do it.

The desired statistical power is .80, however in practice, it is typically lower and .50 is actually pretty good. When the odds ratio of 1.1 is converted to an effect size of 0.053 standard deviations. The statistical power of this hypothesis test is only going to be 0.05.

Revisiting the Story of Ronald Reagan and HIV

In almost every classroom for health science that delves into history, there are two things that are talked about — “Tuskegee” (which is 80% myth, 10% fact, 10% performative activism) and how awful Ronald Reagan was with the HIV/AIDS stuff.

The central thesis of the widely held position is:

“We would have a radically different HIV/AIDS situation if Republicans and Ronald Reagan just acted better.”

There two major presuppositions being made that are not exempt from scrutiny.

Public policy has an influence on the sexual behaviors of individuals.
If the government does have influence on the sexual behaviors of individuals, HIV related is one of the areas where people can be influenced by the government.

A review of 18 meta analyses on the malleability sexual behaviors by interventions found an effect size of 0.08 standard deviations (Johnson et al., 2010). This can be portrayed as “Interventions increase behaviors that prevent STD’s and unwanted pregnancies by 15.6%,” however, for the statistical reasons mentioned earlier, this is bad practice.

Looking into Johnson et al., 2010 review for specific topics, there were 13 meta analytic papers related to HIV-related sexual risk behavior interventions. One would think that HIV related behavior changes are easier to induce than others, but it turns out not so much. When reviews related to HIV interventions are examined, the effect size does not change much at all.

The effect size for HIV interventions was a bit higher, however, there appears to be minor bias. A trim and fill analysis moves the needle a little bit but not a statistically significant amount.

The effect size does convert to an odds ratio of 1.177, which can be phrased as an 18% change and imputed into models that would have major long term results. The reality is that there is an extremely high probability that this effect is almost entirely a result of low statistical power and publication bias.

As far as far as how the malleability of sexual behaviors via intervention stacks up against other interventions, it is by far and away the least malleable field in general.

One thing to consider is the correlation between number of studies and meta analytic effect size. It is equally possible that none of these are all that effective in general.

Global Ineffectiveness of Condom Interventions In General

When I found out that people do not really listen all that much to being told to use barrier contraceptives, my personal intuition was that people in America are less inclined because of oral contraceptives being available and people generally are not interested in individuals that they suspect may have an STD.

Surprisingly, it appears that I was wrong. In Charania et al., 2011 there was a considerable disparity between the results in terms of STD effects vs. behavior change effects, it should be called into question “What makes you think that people are telling the truth?” Beyond that, the data is essentially fraudulent in a sense that it is not really generalizable. An odds ratio of 1.81 is equivalent to an effect size of 0.327 Cohen’s d units, which is pretty good, but how they got this number makes it useless when talking about how much the behavior of the general population is malleable — particularly the men.

There is a kicker — considerable positive effects are an artifact of overusing studies on “CSWs” (Commercial Sex workers)

If we do assume that the people are telling the truth, we must ask who is changing their behavior. Of the 20 studies, 6 are in America and 14 are international and the majority of international studies are on female commercial sex workers.

The US odds ratio of 1.41 is equal to an effect size of 0.189 Cohen’s d units — small effect, when the 4 studies looking at youth are evaluated independently, there is no statistically significant effect.

In the international studies that were not targeting commercial sex workers, the effect size was non-significant and equivalent to 0.123 Cohen’s d units. A headline can point out the odds ratio of 2.09 as “over double” to make a medium effect (0.406 Cohen’s d units) seem like a remarkable find akin.

Another issue is that if you read some of the intervention studies synthesized, a lot of the women were self selected! Not only are the female commercial sex workers a non-representative sample to talk about “global interventions” broadly, but they are not even representative of female sex workers from their countries in general!

Socioeconomics and Chronic Disease: Probably Hollow, Lots of Focus in Classes

When groups are declared to be “at risk” it often means that they have an increased probability of contracting an illness, having complications, or being diagnosed with a chronic disease. Frequently, the effect size is actually really small and just is a way to ruffle some feathers politically because healthcare is probably one of the most polarizing issues in America.

In America, studies looking for a causal influence of socioeconomic status on obesity have been mostly hollow. A meta analysis found an odds ratio of just 1.22 in American studies (Kim & von dem Knesebeck, 2017). That is an effect size of just 0.110 and not a statistically significant difference from zero.

In Canada, a paper with a sample size of over 27,000 looked into physical inactivity and found the odds ratio of high to income to be just 1.15 after controlling for age (Leemsta et al., 2015), which is an awfully small effect size; it is equivalent to just 0.077 Cohen’s d units.

Given the sample size of the Canadian paper, that probably is a true effect, however, it is just correlational. The researchers also found an effect size equivalent to 0.36 Cohen’s d units for heart disease between the high and low income groups. That is a reasonable concern that should be investigated further, especially given the way that their healthcare system works.

In America, it is very easy to look at data and blame things on income, however, it is on dubious merits. A huge problem is that there is incentive to do this! You will get promotions and lots of grant money if you do exactly what you should not do (the sociologist’s fallacy).

It is true that the correlation between income and obesity at the state level was not always what it is today (Bentley et al., 2018).

The hypothesis of social multiplier effects is possible, however, it is a hypothesis. In fact, the researchers themselves seem to disprove the very thing that they claim to provide evidence of.

Within the Bentley et al., 2018 paper, they leave a graph that shows the change in correlation between the natural log of the county level median income and its obesity rate. This is a stagnant correlation and not likely to be causal.

Left: Obesity and natual log of income by state and county correlation by year. Right: Natural log of income and T2D correlation sorted by sex (authors did not indicate why no curve for men exists — presumably harms their case).

For diabetes, the authors broke the data up by gender at the county level. It is also weaker at the more granular level and there is nothing for males. In the past, data within a county has been broken down by census tract. There was a county in Washington that had a very high level of income inequality and researchers examined the prevalence of diabetes by census tract. The correlation was only -0.18 and the range of median incomes ran from under $20,000 to well over $100,000. The fact that these correlations get weaker as opposed to stronger when it is made more granular brings question as to what the correlation between the variables are (Drewnowksi et al., 2014).

At the most granular level, the individual, there is not a whole lot of evidence in the modern day for America going either way. There was a study by the highly controversial sociologist and evolutionary psychologist, Satoshi Kanazawa, that controlled for childhood IQ, childhood SES, father’s BMI, mother’s BMI, BMI at 16, and educational attainment to predict obesity at age 51. Using all of these variables, the effect size produced was 0.787 Cohen’s d units, but the majority of explained variance came from BMI at age 16.

This was a cohort in the UK so it may not be totally transferable to American society, however, his findings are relevant in so far as they are a western nation and a healthcare system really cannot do much to prevent obesity.

While the heading of his regression indicates that he interprets the data differently than I do, column 3 of his table shows that not a lot of variance can be explained by the variables given when you exclude BMI at age 16. The Cox & Snell pseudo r-square of 0.05 can convert to a medium effect size (d = 0.459), however, the variance explained is largely a function of controlling for the parents BMI — which does have some degree of a genetic influence (Kanazawa, 2013). In fact, since IQ and education are good enough to explain a very considerable share socioeconomic status as an adult, the 2nd column should be noted as well (d=0.263), since the R² ratio of IQ to education exceeds 2:1, it should be questioned as to what percentage of that is nurture vs. what percentage is nature.

In Conclusion…

Most of the stuff in health research interventions is