What Is Propensity Matching and How Can We Improve Validity of Causality Claims?

Introduction

It is generaly hard to do randomized trial in performance environment, especially with higher-level athletes. It is a gold standard method that tries to minimize confounders (although they do happen as I will demonstrate) between control and treatment groups, so the effects could be assigned to the treatment and not something else (confounders). This allows for certainty in making causal claims (Treatment A cause effects B).

But can we do same with observational studies and data? Less likely, since we are not controlling (by the design) all the potential confounders that might influence the effects and hence reduce the confidence in causal claims. They are easier to do though. Much easier. But can we do something about it? Reading Data Analysis Using Regression and Multilevel/Hierarchical Models, which is must have book btw, gave me couple of ideas/solutions that I want to explore in this post. So, yes this is a playbook, and I am hoping it will give you some food for thought and practical tools for data analysis.

Data set

For the purpose of this playbook, I created data set that represents hypothetical randomized trial of novel periodization method. We are interested in the effects of new periodization method to the increase in back squat 1RM.

We have 100 subjects, evenly split into control and treatment group, whose 1RM back squat is measured before and after treatment. Subject characteristics include training age, body weight and pre-test 1RM. By randomizing we want to make sure that the control and treatment groups are not-significantly different in those characteristic.

Here is the data generation script:

# Set random number generator seed so you can reproduce the scores
set.seed(1982) 

athletesNumber = 100

# Generate group (Control/Treatment)
group <- gl(n = 2, length = athletesNumber,
          k = 1, labels = c("Control", "Treatment"))

# Generate Age
ageTraining <- round(rnorm(n = athletesNumber, mean= 10, sd = 2), 0)

# Generate BW
bodyWeight <- round(rnorm(n = athletesNumber, mean= 80, sd = 10), 0)

# Pretest squat 1RM
preTestSquat1RM <- round(1.5 * ageTraining + 1.5 * bodyWeight +
                        rnorm(n = athletesNumber, mean= 0, sd = 10), 0)

# Intervention training volume (Arbitrary number)
trainingVolume <- round(100 + ifelse(group == "Treatment",
                                     15 + rnorm(n = athletesNumber, mean= 0, sd = 15),
                                     0 +  rnorm(n = athletesNumber, mean= 0, sd = 8)), 0)

# Posttest squat 1RM
postTestSquat1RM <- round(1.1 * preTestSquat1RM + 
                          0.15 * trainingVolume +
                          ifelse(group == "Treatment", 4, 0) +
                          rnorm(n = athletesNumber, mean= 0, sd = 8), 0)

# Difference
difference <- postTestSquat1RM - preTestSquat1RM

# Percent change
change <- round(difference / preTestSquat1RM, 2)

# Save the data in data frame
studyData <- data.frame(ageTraining,
                        bodyWeight,
                        preTestSquat1RM,
                        group,
                        trainingVolume,
                        postTestSquat1RM,
                        difference,
                        change)

Let’s plot the data. Each dot represent a subject (they are “jittered” for easier viewing). The vertical line and a dot represent mean and 1 SD.

suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(reshape2))

# Convert to long format for plotting
studyDataLong <- melt(studyData, id.vars = "group")

gg <- ggplot(studyDataLong, aes(x = group, y = value, color = group)) +
      geom_jitter(alpha = 0.3, width = 0.4) +
      stat_summary(geom = "pointrange", fun.data = "mean_sdl",
                   size = 0.4, color = "black") +
      facet_wrap(~variable, scales = "free") +
      theme_bw() +
      theme(legend.position="none")
print(gg)

As can be seen from the graph there is not much difference between Control and Treatment group in trainign age, body weight and pre test 1RM. So far so good. There is some difference between training volume (arbitary number) and of course between training effects: post test 1RM, difference and change.

Let’s perform t-test on each parameter to check if the differences between control and treatment groups are statistically significant:

lapply(studyData[-4], function(x) t.test(x ~ studyData$group))
## $ageTraining
## 
##  Welch Two Sample t-test
## 
## data:  x by studyData$group
## t = -0.049943, df = 97.924, p-value = 0.9603
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8147054  0.7747054
## sample estimates:
##   mean in group Control mean in group Treatment 
##                   10.02                   10.04 
## 
## 
## $bodyWeight
## 
##  Welch Two Sample t-test
## 
## data:  x by studyData$group
## t = -1.4167, df = 96.951, p-value = 0.1598
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.146431  1.026431
## sample estimates:
##   mean in group Control mean in group Treatment 
##                   80.04                   82.60 
## 
## 
## $preTestSquat1RM
## 
##  Welch Two Sample t-test
## 
## data:  x by studyData$group
## t = -1.4886, df = 97.773, p-value = 0.1398
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.685656   1.525656
## sample estimates:
##   mean in group Control mean in group Treatment 
##                  134.84                  139.42 
## 
## 
## $trainingVolume
## 
##  Welch Two Sample t-test
## 
## data:  x by studyData$group
## t = -5.3087, df = 75.164, p-value = 1.081e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.483205  -8.396795
## sample estimates:
##   mean in group Control mean in group Treatment 
##                  100.38                  113.82 
## 
## 
## $postTestSquat1RM
## 
##  Welch Two Sample t-test
## 
## data:  x by studyData$group
## t = -2.6192, df = 97.628, p-value = 0.01022
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -17.577058  -2.422942
## sample estimates:
##   mean in group Control mean in group Treatment 
##                  165.02                  175.02 
## 
## 
## $difference
## 
##  Welch Two Sample t-test
## 
## data:  x by studyData$group
## t = -3.4723, df = 91.573, p-value = 0.0007896
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.520321 -2.319679
## sample estimates:
##   mean in group Control mean in group Treatment 
##                   30.18                   35.60 
## 
## 
## $change
## 
##  Welch Two Sample t-test
## 
## data:  x by studyData$group
## t = -2.6219, df = 82.393, p-value = 0.01041
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.052760283 -0.007239717
## sample estimates:
##   mean in group Control mean in group Treatment 
##                  0.2262                  0.2562

We can also perform bootstraped Effect Size (ES; Cohen’s D) calculation:

suppressPackageStartupMessages(library(bootES))

lapply(studyData[-4], function(x) {bootES(data.frame(param = x, group = studyData$group),
                                      data.col = "param",
                                      group.col = "group",
                                      contrast = c("Control", "Treatment"),
                                      effect.type = "cohens.d")})
## $ageTraining
## 
## User-specified lambdas: (Control, Treatment)
## Scaled lambdas: (-1, 1)
## 95.00% bca Confidence Interval, 2000 replicates
## Stat        CI (Low)    CI (High)   bias        SE          
## 0.010       -0.400      0.416       -0.007      0.204       
## 
## 
## $bodyWeight
## 
## User-specified lambdas: (Control, Treatment)
## Scaled lambdas: (-1, 1)
## 95.00% bca Confidence Interval, 2000 replicates
## Stat        CI (Low)    CI (High)   bias        SE          
## 0.283       -0.128      0.691       0.004       0.210       
## 
## 
## $preTestSquat1RM
## 
## User-specified lambdas: (Control, Treatment)
## Scaled lambdas: (-1, 1)
## 95.00% bca Confidence Interval, 2000 replicates
## Stat        CI (Low)    CI (High)   bias        SE          
## 0.298       -0.122      0.681       0.011       0.206       
## 
## 
## $trainingVolume
## 
## User-specified lambdas: (Control, Treatment)
## Scaled lambdas: (-1, 1)
## 95.00% bca Confidence Interval, 2000 replicates
## Stat        CI (Low)    CI (High)   bias        SE          
## 1.062       0.627       1.455       0.020       0.207       
## 
## 
## $postTestSquat1RM
## 
## User-specified lambdas: (Control, Treatment)
## Scaled lambdas: (-1, 1)
## 95.00% bca Confidence Interval, 2000 replicates
## Stat        CI (Low)    CI (High)   bias        SE          
## 0.524       0.110       0.948       0.013       0.212       
## 
## 
## $difference
## 
## User-specified lambdas: (Control, Treatment)
## Scaled lambdas: (-1, 1)
## 95.00% bca Confidence Interval, 2000 replicates
## Stat        CI (Low)    CI (High)   bias        SE          
## 0.694       0.286       1.117       0.011       0.208       
## 
## 
## $change
## 
## User-specified lambdas: (Control, Treatment)
## Scaled lambdas: (-1, 1)
## 95.00% bca Confidence Interval, 2000 replicates
## Stat        CI (Low)    CI (High)   bias        SE          
## 0.524       0.089       0.950       0.016       0.215

As can be seen from the ES, the differences between control and treatment in training age, body weight and pre-test squat 1RM are very small and their CIs include zero (hence they are non-significant). That is not the case with training volume, post-test 1RM, difference and percent change.

Analysis

As a researchers and coaches we might be interested in the effects of this novel periodization on the changes in back squat 1RMs (in the population; inference or generalization). To do this we can perform ANOVA or linear regression. I preffer linear regression approach (the results are the same):

model1 <- lm(difference ~ group, studyData)
summary(model1)
## 
## Call:
## lm(formula = difference ~ group, data = studyData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.18  -6.60  -0.60   5.82  16.40 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      30.180      1.104  27.343  < 2e-16 ***
## groupTreatment    5.420      1.561   3.472 0.000769 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.805 on 98 degrees of freedom
## Multiple R-squared:  0.1096, Adjusted R-squared:  0.1005 
## F-statistic: 12.06 on 1 and 98 DF,  p-value: 0.0007694
confint(model1)
##                    2.5 %    97.5 %
## (Intercept)    27.989665 32.370335
## groupTreatment  2.322398  8.517602

And we can conclude that effect of this novel periodization method is statistically significant.

But what we didn’t take into account is the training volume. We can plug training volume into the regression formula as interaction, but before that let’s plot scatterplot of training volume and difference change in 1RM:

gg <- ggplot(studyData, aes(x = trainingVolume,
                            y = difference,
                            color = group,
                            fill = group)) +
      geom_smooth(alpha = 0.1, method = "lm") + 
      geom_point(alpha = 0.6) +
      theme_bw() +
      theme(legend.position="top")
print(gg)

They do seem to overlap, where the threatment group have much bigger spread and higher mean. The do seem to have slope (effect) of same size, which can be that the treatment is NOT what is affecting the changes in 1RM.

Let’s do regression analysis and see if the effects of training volume are significant. First we are going to perform linear regression regardless of the group (i.e. pooled):

model2 <- lm(difference ~ trainingVolume, studyData)
summary(model2)
## 
## Call:
## lm(formula = difference ~ trainingVolume, data = studyData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.3498  -5.6979   0.2872   5.3213  13.8602 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.97126    5.70225   1.223    0.224    
## trainingVolume  0.24201    0.05278   4.585 1.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.505 on 98 degrees of freedom
## Multiple R-squared:  0.1766, Adjusted R-squared:  0.1682 
## F-statistic: 21.02 on 1 and 98 DF,  p-value: 1.34e-05
confint(model2)
##                     2.5 %     97.5 %
## (Intercept)    -4.3446661 18.2871856
## trainingVolume  0.1372666  0.3467435

This is the control group:

model3 <- lm(difference ~ trainingVolume, subset(studyData, group == "Control"))
summary(model3)
## 
## Call:
## lm(formula = difference ~ trainingVolume, data = subset(studyData, 
##     group == "Control"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1996  -7.4195   0.3462   5.6633  14.0805 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)      6.6622    14.6587   0.454    0.652
## trainingVolume   0.2343     0.1455   1.610    0.114
## 
## Residual standard error: 8.639 on 48 degrees of freedom
## Multiple R-squared:  0.05123,    Adjusted R-squared:  0.03147 
## F-statistic: 2.592 on 1 and 48 DF,  p-value: 0.114
confint(model3)
##                       2.5 %     97.5 %
## (Intercept)    -22.81116459 36.1354697
## trainingVolume  -0.05830778  0.5268841

And for the treatment group:

model4 <- lm(difference ~ trainingVolume, subset(studyData, group == "Treatment"))
summary(model4)
## 
## Call:
## lm(formula = difference ~ trainingVolume, data = subset(studyData, 
##     group == "Treatment"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2643  -4.7499  -0.0554   3.8668  12.8624 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)    14.60671    6.33869   2.304  0.02557 * 
## trainingVolume  0.18444    0.05517   3.343  0.00161 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.089 on 48 degrees of freedom
## Multiple R-squared:  0.1888, Adjusted R-squared:  0.1719 
## F-statistic: 11.18 on 1 and 48 DF,  p-value: 0.001614
confint(model4)
##                     2.5 %     97.5 %
## (Intercept)    1.86191093 27.3515104
## trainingVolume 0.07350768  0.2953781

What if we plug in the training volume as interaction (pretty much ANCOVA) with group? Will it decrease the effects of the treatment?

model5 <- lm(difference ~ trainingVolume*group, studyData)
summary(model5)
## 
## Call:
## lm(formula = difference ~ trainingVolume * group, data = studyData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1996  -5.5512   0.1535   5.5368  14.0805 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                    6.66215   12.68145   0.525   0.6006  
## trainingVolume                 0.23429    0.12589   1.861   0.0658 .
## groupTreatment                 7.94456   14.87760   0.534   0.5946  
## trainingVolume:groupTreatment -0.04985    0.14295  -0.349   0.7281  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.473 on 96 degrees of freedom
## Multiple R-squared:  0.2002, Adjusted R-squared:  0.1752 
## F-statistic: 8.011 on 3 and 96 DF,  p-value: 8.068e-05
confint(model5)
##                                      2.5 %     97.5 %
## (Intercept)                   -18.51032362 31.8346288
## trainingVolume                 -0.01561124  0.4841876
## groupTreatment                -21.58725017 37.4763664
## trainingVolume:groupTreatment  -0.33360203  0.2339114

As can be seen the model perform better (lower residual standar error), but now the effect of treatment is not statistically significant.

Propensity Score matching

Another approach, besides plugging covariates into the regression together with the group variable, is matching.

Sometimes the randomized assignment to groups doesn’t result in equal groups. This is more common in observational studies. What we are looking to do is to “match” the individuals, or in other words find the pairs of individuals (or observations) in each group (control and treatment) that have similar characteristics (covariates). Usually, covariates are “pre intervention” variables, but as seen above I’ve used training volume as covariate as well.

How to do this? One method is to use “propensity scores”. Propensity score in nothing more than probability of being in treatment group. This is usually done with logistic regression, where we want to predict the group of the individual based on his covariates.

Let’s do it with out data, utilizing training age, body weight and pre test 1RM in this case:

propensityModel1 <- glm(group ~ ageTraining + bodyWeight + preTestSquat1RM,
                       studyData,
                       family = "binomial")

studyData$propensityScore1 <- predict(propensityModel1, type = "response")

If we plot the distribution of propensity scores across group we want to see overlap. Overlap means that individuals have same/similar probability being in either group based on the covariates. That is exactly what we want with randomized trial.

gg <- ggplot(studyData, aes(x = propensityScore1, fill = group)) +
      geom_density(alpha = 0.5) +
      theme_bw()

print(gg)

As can be seen there is an overlap. We can perform t-test on propensity score to check if the groups differ

t.test(propensityScore1 ~ group, studyData)
## 
##  Welch Two Sample t-test
## 
## data:  propensityScore1 by group
## t = -1.6148, df = 96.045, p-value = 0.1096
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.057181903  0.005879697
## sample estimates:
##   mean in group Control mean in group Treatment 
##               0.4871744               0.5128256

As can be seen the difference is not statistically significant. This mean that we performed randomization correctly.

But what if we enter training volume into the equation?

propensityModel2 <- glm(group ~ ageTraining + bodyWeight +
                                preTestSquat1RM + trainingVolume,
                       studyData,
                       family = "binomial")

studyData$propensityScore2 <- predict(propensityModel2, type = "response")

gg <- ggplot(studyData, aes(x = propensityScore2, fill = group)) +
      geom_density(alpha = 0.5) +
      theme_bw()

print(gg)

There is less overlap. This is troublesome because this means that our groups are not randomized correctly. Let’s perform t-test

t.test(propensityScore2 ~ group, studyData)
## 
##  Welch Two Sample t-test
## 
## data:  propensityScore2 by group
## t = -5.4355, df = 84.328, p-value = 5.214e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3174002 -0.1473701
## sample estimates:
##   mean in group Control mean in group Treatment 
##               0.3838074               0.6161926

And as can be expected, the propensity score between groups significantly differs.

What can we do? Well, we can perform matching. Matching is trying to find pairs of subjects (from each group) that are most similar in propensity score. There are multiple algorythms to do this and I with use MatchIt.

suppressPackageStartupMessages(library(MatchIt))

# Convert group to 0 = Control, 1 = Treatment as needed by matchit function
studyData$treatment = ifelse(studyData$group == "Treatment", 1, 0)

m <- matchit(treatment ~ ageTraining + bodyWeight +
                         preTestSquat1RM + trainingVolume,
             data = studyData,
             method = "nearest",
             discard = "both")
print(m)
## 
## Call: 
## matchit(formula = treatment ~ ageTraining + bodyWeight + preTestSquat1RM + 
##     trainingVolume, data = studyData, method = "nearest", discard = "both")
## 
## Sample sizes:
##           Control Treated
## All            50      50
## Matched        31      31
## Unmatched      18       0
## Discarded       1      19

Let’s calculate our propensity scores again after matching and plot it:

matched <- match.data(m)

propensityModel3 <- glm(group ~ ageTraining + bodyWeight +
                                preTestSquat1RM + trainingVolume,
                       matched,
                       family = "binomial")

matched$propensityScore <- predict(propensityModel3, type = "response")

gg <- ggplot(matched, aes(x = propensityScore, fill = group)) +
      geom_density(alpha = 0.5) +
      theme_bw()

print(gg)

As can be seen there is an overlap now. Let’s plot the distributions as we did at the beginning, but now using matched subjects:

# Convert to long format for plotting
macthedLong <- melt(matched[1:8], id.vars = "group")

gg <- ggplot(macthedLong, aes(x = group, y = value, color = group)) +
      geom_jitter(alpha = 0.3, width = 0.4) +
      stat_summary(geom = "pointrange", fun.data = "mean_sdl",
                   size = 0.4, color = "black") +
      facet_wrap(~variable, scales = "free") +
      theme_bw() +
      theme(legend.position="none")
print(gg)

As it seems now, the training volume is much more similar between groups than before matching.

We can plot the relationship of training volume to change scores as we did before:

gg <- ggplot(matched, aes(x = trainingVolume,
                            y = difference,
                            color = group,
                            fill = group)) +
      geom_smooth(alpha = 0.1, method = "lm") + 
      geom_point(alpha = 0.6) +
      theme_bw() +
      theme(legend.position="top")
print(gg)

As can be seen they pretty much overlap, which is the goal of matching.

Let’s do linear regression to judge the effects:

matchedModel1 <- lm(difference ~ group, matched)
summary(matchedModel1)
## 
## Call:
## lm(formula = difference ~ group, data = matched)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9677  -4.7742  -0.6935   5.0323  14.0323 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      30.968      1.398  22.156   <2e-16 ***
## groupTreatment    2.226      1.977   1.126    0.265    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.782 on 60 degrees of freedom
## Multiple R-squared:  0.02069,    Adjusted R-squared:  0.004373 
## F-statistic: 1.268 on 1 and 60 DF,  p-value: 0.2646
confint(matchedModel1)
##                   2.5 %    97.5 %
## (Intercept)    28.17186 33.763626
## groupTreatment -1.72817  6.179783

If we compare it the un-matched data:

model1 <- lm(difference ~ group, studyData)
summary(model1)
## 
## Call:
## lm(formula = difference ~ group, data = studyData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15.18  -6.60  -0.60   5.82  16.40 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      30.180      1.104  27.343  < 2e-16 ***
## groupTreatment    5.420      1.561   3.472 0.000769 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.805 on 98 degrees of freedom
## Multiple R-squared:  0.1096, Adjusted R-squared:  0.1005 
## F-statistic: 12.06 on 1 and 98 DF,  p-value: 0.0007694
confint(model1)
##                    2.5 %    97.5 %
## (Intercept)    27.989665 32.370335
## groupTreatment  2.322398  8.517602

We can see that there is a huge difference, and that matching subjects yielded much smaller effect of the treatment. Pretty much the cause is different training volume and not the intervention.

Let’s compare models with interaction. Here is matched:

matchedModel2 <- lm(difference ~ group*trainingVolume, matched)
summary(matchedModel2)
## 
## Call:
## lm(formula = difference ~ group * trainingVolume, data = matched)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0647  -4.7075  -0.0358   4.7694  12.9007 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                    -2.0131    16.4268  -0.123   0.9029  
## groupTreatment                 12.7240    21.8811   0.582   0.5632  
## trainingVolume                  0.3218     0.1597   2.015   0.0486 *
## groupTreatment:trainingVolume  -0.1056     0.2114  -0.500   0.6191  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.506 on 58 degrees of freedom
## Multiple R-squared:  0.1194, Adjusted R-squared:  0.07382 
## F-statistic: 2.621 on 3 and 58 DF,  p-value: 0.05925
confint(matchedModel2)
##                                       2.5 %     97.5 %
## (Intercept)                   -34.894971896 30.8688176
## groupTreatment                -31.075906613 56.5238280
## trainingVolume                  0.002047561  0.6415819
## groupTreatment:trainingVolume  -0.528698476  0.3174280

And here is our original un-matched data set:

model5 <- lm(difference ~ trainingVolume*group, studyData)
summary(model5)
## 
## Call:
## lm(formula = difference ~ trainingVolume * group, data = studyData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1996  -5.5512   0.1535   5.5368  14.0805 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                    6.66215   12.68145   0.525   0.6006  
## trainingVolume                 0.23429    0.12589   1.861   0.0658 .
## groupTreatment                 7.94456   14.87760   0.534   0.5946  
## trainingVolume:groupTreatment -0.04985    0.14295  -0.349   0.7281  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.473 on 96 degrees of freedom
## Multiple R-squared:  0.2002, Adjusted R-squared:  0.1752 
## F-statistic: 8.011 on 3 and 96 DF,  p-value: 8.068e-05
confint(model5)
##                                      2.5 %     97.5 %
## (Intercept)                   -18.51032362 31.8346288
## trainingVolume                 -0.01561124  0.4841876
## groupTreatment                -21.58725017 37.4763664
## trainingVolume:groupTreatment  -0.33360203  0.2339114

The coefficients are pretty similar, but the effect of training volume is even higher in matched group

Conclusion

Researchers usually try to randomize correctly, but sometimes the effect is not in the treatment, but maybe in the differnce between group (pre-treatment) or during the treatment (for example volume of training) which should be similar/same so we can judge the effects of something else of interest (for example novel periodization). The solutions are to involve covariates in the model and/or do the matching. Sometimes the number of subject is so small, so the matching is out of question (but it is really interesting tool with observational data where we might have much more subjects/observatoins).

Hence next time when you read that the new periodization or any other treatment is statistically significant, make sure to read the methods and statistical analysis as well.

Hope this example gave you some ideas and practical tools you can use.

Related Articles

Fat Tails and Inequality

The following R workbook is meant to show different effects (or should I call them inequalities or differences) between groups when different thresholds for selection are applied. Hopefully, this will be a good example on how it is important to be statistically educated when comparing groups and discussing inequality.

Responses

Your email address will not be published. Required fields are marked *

Cancel Membership

Please note that your subscription and membership will be canceled within 24h once we receive your request.