*Bounty: 50*

I have a sample of about 4,000 $r$ (that is, Pearson correlation), $t-$, or $F-$ tests reported in psychology journals. These tests have been drawn randomly from a larger dataset with about 500,000 statistical tests extracted from ~33,000 articles from 132 psychology journals.

For each statistical test I have the following data:

– Test statistic value

– Category of test statistic ($t$, $F$, or $r$)

– Degrees of freedom (both degrees of freedom in the case of F-test)

– Estimate of standard error (since tests with missing degrees of freedom were excluded, I can generate this from the test statistic and degrees of freedom)

– Reported p-value

– Whether the reported p-value is consistent with the test statistic and degrees of freedom (with inconsistency likely indicating a reporting error)

– Year of publication (range from 1980-2019, though weighted towards more recent articles)

– Journal of publication

– Pre-registration status (either the article underwent some sort of pre-registration, or not)

– Classification of the statistical test as either “central” or “peripheral”

That last point relates to a classification of whether the statistic test was central to the main aims of the article, or whether it was instead peripheral to those aims (e.g. a statistical test done in the course of assumption-checking). These judgments were made by human raters, who have been shown to have good reliability/validity in relation to this task.

All the test statistics are converted to Fisher Z-transformed correlation coefficients, so that they may be compared.

There are two main research questions of interest.

- Are “central” effect sizes declining over time? From a prior analysis (in which no distinction could be made between central and peripheral tests) we already suspect that overall test statistics are slightly declining over time.
- Are statistical reporting errors more common in central tests, or peripheral tests?

I’d originally planned to address these questions using two multilevel models,

- A multilevel regression in which tests are nested inside journals, and the outcome variable is the test effect size (Fisher Z-transformed correlation coefficient). Predictors would be statistic type ($F$, $t$, $r$), focal/peripheral status, and year of publication.
- A multilevel logistic regression in which tests are nested inside journals, and the outcome variable is the probability the test contains a reporting error. Predictors would be statistic type ($F$, $t$, $r$) and focal/peripheral status.

It’s been suggested to me that I should instead be doing “a multilevel meta-regression”. This is not a concept I was previously familiar with, but looking at the Cochrane handbook I read that

Meta-regressions usually differ from simple regressions in two ways.

First, larger studies have more influence on the relationship than

smaller studies, since studies are weighted by the precision of their

respective effect estimate. Second, it is wise to allow for the

residual heterogeneity among intervention effects not modelled by the

explanatory variables. This gives rise to the term ‘random-effects

meta-regression’, since the extra variability is incorporated in the

same way as in a random-effects meta-analysis.

It wasn’t obvious to me how either of those things would be relevant in my context.

Regarding the first research question (effect sizes over time), I get that weighting large N studies higher makes sense if the meta-analysis is interested in the size of the underlying effects being studied by psychologists. However, if the meta-analysis is only interested in assessing the effect sizes psychologists report over time I don’t see why large $N$ studies should be weighted higher.

Regarding the second research question (statistical reporting errors), I don’t see why large $N$ studies should be weighted higher.

What analysis should I be doing?

Get this bounty!!!