estimating.qmd

# Estimating the treatment effect {#sec-estimating}

Once an adequate specification has been found, it comes time to estimate the treatment effect and its standard error. Because the goal of propensity score analysis is to mimic the qualities of a randomized trial, methods for estimating effects in randomized trials can be used in the conditioned data, with some special attention required for characterizing uncertainty. The outcome type (e.g., binary, count, survival time) and estimand (e.g., risk difference, incidence ratio, hazard ratio) determine the specific method used to estimate the effect, but they generally follow the same structure. Below, we'll describe how to estimate the effect and how to compute its standard error or confidence interval.

## The effect estimate

The simplest way to compute the treatment effect after matching or weighting is to compute the weighted mean of the outcome in each treatment group and contrast them to form the estimand of choice. For example, for a binary outcome of death by 6 months after propensity score weighting for the ATE, we can compute the weighted proportion of deaths in the treated group and the weighted proportion of deaths in the control group, which give us the expected potential outcomes under treatment and control for the full sampled population. We can compute the risk ratio by taking the ratio of these weighted proportions. Equally simple would be to run a weighted log-binomial regression of death by 6 months on the treatment and use the exponentiated coefficient on treatment as the estimated risk ratio.

### Adjusting for covariates

Further adjustment for covariates in the outcome model is a good idea as it improves precision at little to no cost and often reduces bias (though less so if the matching or weighting was effective)[^estimating-1]. However, it is important to do so using a specific procedure that does not distort the target estimand. This procedure is called "g-computation" [@snowdenImplementationGComputationSimulated2011], and its weighted variant is weighted g-computation [@gabrielInverseProbabilityTreatment2024; @vansteelandtInvitedCommentaryGComputation2011], which will be the focus of this section.

[^estimating-1]: You may also hear about the potential for "double-robustness" [@danielDoubleRobustness2018; @kangDemystifyingDoubleRobustness2007], i.e., that the estimate is consistent if either the propensity score model or outcome model are correct, giving you two chances to get it right. I won't discuss this perspective because the perspective I take is that the conditioning should do all the work of balancing the covariates and removing bias, and the purpose of the outcome model is primarily to increase precision. Truly doubly-robust estimators involve different procedures, different modes of inference, and different priorities in assessment than traditional propensity score analysis.

**Do not use the coefficient on treatment as an estimate of the treatment effect when covariates are included in the model**. This coefficient generally does not correspond to a valid marginal estimand and its interpretation depends on correct specification of the outcome model (and if we could do that, we wouldn't need propensity score analysis in the first place). G-computation is a method of computing a marginal treatment effect from a model that preserves the estimand, is agnostic to the form of the model, and doesn't require the model to be correctly specified to see the benefits of covariate adjustment.

Generally, g-computation works as follows:

1.  Fit a model for the outcome given the treatment, covariates, and their interaction.

2.  Generate predicted values from this model for all units after setting their treatment status to treated, regardless of their actual treatment status; these are the estimated potential outcomes under treatment.

3.  Generate predicted values from this model for all units after setting their treatment status to control, regardless of their actual treatment status; these are the estimated potential outcomes under control.

4.  Compute the mean of the estimated potential outcomes under treatment and under control; these are the expected potential outcomes under treatment and control.

5.  Contrast the expected potential outcomes to arrive at a marginal treatment effect estimate.

When weights are included, we need to adjust the procedure slightly: the outcome model should be fit as a weighted model[^estimating-2]. This ensures the balancing properties of the weights are employed.

[^estimating-2]: When sampling weights are used or weights are estimate that target an estimand other than the ATE, ATT, or ATC, the weights must also be included to compute the weighted means of the estimated potential outcomes in each group to ensure the estimate reflects the correct target population.

One of the primary benefits of g-computation is that the model and the estimand are completely divorced from each other, so one can fit the right model for the data regardless of which contrast is to be reported. For example, for a binary outcome, one can fit a logistic regression model but compute a risk difference using g-computation. Traditional approaches that rely on using the coefficient on treatment require the model to correspond to the desired contrast, which can yield bias if the model is not right for the data (e.g., a linear model for a binary outcome, which can produce predictions beyond 0 and 1).

Another benefit of g-computation is that the procedure is exactly the same no matter what model is used for the outcome, how that model is parameterized (i.e., whether polynomials or interactions are included in the model), or how the data was conditioned (i.e., matching or weighting). This obviates the need for considering the interpretability of the original model or preprocessing the data to ensure coefficients have valid interpretations, such as using contrast coding or centering. This also opens the door to using outcome models that are traditionally challenging to interpret (e.g., probit regression instead of logistic regression for binary outcomes, splines to capture nonlinear covariate effects, or multi-part models like zero-inflated Poisson models for count outcomes).

### Survival outcomes

Survival outcomes often require additional considerations because of censoring and the nature of the outcome. The most common method for describing the effect of a treatment on a survival outcome is the hazard ratio, which corresponds to the exponentiated coefficient in a Cox proportional hazards model. Hazard ratios carry a number of problems, which include that the hazard ratio is confounded, even in randomized trials, after a single event occurs [@hernanHazardsHazardRatios2010]; the hazard ratio is noncollapsible, meaning its value differs when considering individuals vs. the population; and the proportional hazards assumption is almost never satisfied, making the Cox model coefficient correspond to an ambiguously weighted average of the non-proportional time-specific hazards [@stensrudWhyTestProportional2020a]. Other estimands have been proposed that avoid the problems with hazard ratios and Cox models, which often are computed from the Kaplan-Maier estimate of the survival function, such as the restricted mean survival time [@pakInterpretabilityCancerClinical2017] or risk difference at pre-specified follow-up intervals [@cafriVarianceEstimationRisk2023].

For many of these methods, weighted versions are available that can incorporate weights resulting from matching or weighting. Including covariates in the outcome model while maintaining a marginal estimand is often less straightforward, and methods for doing so are less developed or accessible, so we recommend simply using the weighted versions with treatment as the sole predictor or grouping variable.

## Confidence intervals and standard errors

Care needs to be taken when computing confidence intervals (CIs) and standard errors (SEs), as these may need to take into account multiple sources of variation, including from estimation of the propensity scores or weights, the matching, and the outcome. For some methods, analytic formulas for SEs have been developed, but for others we have to rely on approximations or computationally intense methods. When using g-computation, the treatment effect does not correspond to a coefficient in the model, so we also need to take into account how SE of the treatment effect is derived from the model coefficient SEs.

There are three broad methods we can use for computing CIs and SEs after propensity score analysis: asymptotically correct standard errors using M-estimation, (cluster-) robust SEs, and bootstrap CIs, which we describe below. See @tbl-se-ci for a summary of which method should be used in different circumstances.

### Asymptotically Correct Standard Errors Using M-estimation

M-estimation can be used with weigthing methods and involves jointly estimating the parameters of the weighting model (e.g., the coefficients in the logistic regression model for the propensity score) and the parameters of the outcome model (i.e., the coefficients on treatment and the covariates), which adjusts the SEs for the treatment effect estimate to reflect both sources of uncertainty. Most theoretical developments on the estimation of asymptotically correct standard errors for propensity score-weighted treatment effect estimates rely on M-estimation theory [@luncefordStratificationWeightingPropensity2004; @reifeisVarianceTreatmentEffect2020; @gabrielInverseProbabilityTreatment2024]. We refer curious readers to @stefanskiCalculusMEstimation2002 for an introduction. This method can only be used with weighting and when the model used to estimate the weights involves parametric model (i.e., not a machine-learning model). They tend to perform best when used with large samples [@austinBootstrapVsAsymptotic2022].

### Robust and cluster-robust SEs

Also known as sandwich SEs (due to the form of the formula for computing them), heteroscedasticity-consistent SEs, or Huber-White SEs, robust SEs are an adjustment to the usual maximum likelihood or ordinary least squares SEs that are robust to violations of some of the assumptions required for usual SEs to be valid [@mackinnonHeteroskedasticityconsistentCovarianceMatrix1985]. Robust SEs have been shown to be conservative (i.e., yield overly large SEs and wide confidence intervals) for estimating the ATE after some forms of weighting [@robinsMarginalStructuralModels2000], though they can be either conservative or not for other weighting methods and estimands, such as for the ATT [@reifeisVarianceTreatmentEffect2020] or for entropy balancing [@chanGloballyEfficientNonparametric2016]. Robust SEs treat the estimated weights as if they were fixed and known, ignoring uncertainty in their estimation. Although they are quick and simple to estimate, they should be used with caution, and the asymptoticially correct SEs or bootstrap (described below) should be preferred in most cases.

A version of robust SEs known as cluster-robust SEs [@liangLongitudinalDataAnalysis1986] can be used to account for dependence between observations within clusters (e.g., matched pairs). @abadieRobustPostMatchingInference2020 demonstrate analytically that cluster-robust SEs are generally valid after matching, whereas regular robust SEs can over- or under-estimate the true sampling variability of the effect estimator depending on the specification of the outcome model (if any) and degree of effect modification. Several simulation studies have further confirmed the validity of cluster-robust SEs after matching [e.g., @austinUseBootstrappingWhen2014; @austinTypeErrorRates2009; @gayatPropensityScoreApplied2012; @wanMatchedUnmatchedAnalyses2019].

To compute robust SEs after g-computation, a method known as the delta method is used; this is a way to compute the SEs of the derived quantities (the expected potential outcomes and their contrast) from the variance of the coefficients of the outcome models. For nonlinear models (e.g., logistic regression), the delta method is only an approximation subject to error (though in many cases this error is small and shrinks in large samples). Because the delta method relies on the variance of the coefficients from the outcome model, it is important to correctly estimate these variances.

### Bootstrapping

Bootstrapping is a technique used to simulate the sampling distribution of an estimator by repeatedly drawing samples with replacement and estimating the effect in each bootstrap sample [@efronBootstrapMethodsStandard1986; @carpenterBootstrapConfidenceIntervals2000]. From the bootstrap distribution, SEs and CIs can be computed in several ways, including using the standard deviation of the bootstrap estimates as the SE estimate or using the 2.5 and 97.5 percentiles as 95% CI bounds. Although @abadieFailureBootstrapMatching2008 found analytically that the bootstrap is inappropriate for matching, simulation evidence has found it to be adequate in many cases [@hillIntervalEstimationTreatment2006; @austinUseBootstrappingWhen2014; @austinEstimatingEffectTreatment2017].

Traditionally, bootstrapping involves performing the entire estimation process in each bootstrap sample, including propensity score estimation, matching or weighting, and effect estimation. For weighting, the traditional bootstrap works very well and is generally the best method to use for estimating CIs [@chanGloballyEfficientNonparametric2016; @austinBootstrapVsAsymptotic2022]. For matching, this method may be conservative for matching in some cases (i.e., they are wider than necessary to achieve nominal coverage) [@austinUseBootstrappingWhen2014]. More accurate intervals have been found when using the matched/cluster bootstrap described by @austinUseBootstrappingWhen2014 and @abadieRobustPostMatchingInference2020. The cluster bootstrap involves sampling matched pairs/strata of units from the matched sample (i.e., after the matching is already done) and estimating the effect within each sample composed of the sampled pairs.

| Method | Use for SEs/CIs | Notes |
|--------------------|-------------------|---------------------------------|
| Pair matching w/o replacement and full matching | Cluster bootstrap or cluster-robust SEs |  |
| Pair matching w/ replacement | Robust SEs | Traditional bootstrap performs well in simulations, but analytically is invalid; some analytic methods exist for special cases |
| Other matching methods (including propensity score subclassification) | Bootstrap or robust SEs |  |
| Propensity score weighting for the ATE or ATO | Bootstrap or asymptotically correct SEs | Bootstrap best used in small samples; robust SEs can be used, too, but are conservative |
| Propensity score weighting for the ATT | Bootstrap or asymptotically correct SEs | Bootstrap best used in small samples; robust SEs can be conservative or anti-conservative and should not be used |

: Conditioning methods and corresponding recommended methods for estimating SEs and CIs {#tbl-se-ci}