planning.qmd

# Planning the Analysis {#sec-planning}

The "fun" part of propensity score analysis is actually running the code on data, but there are critical decisions that need to be made before doing so. One needs to decide whether propensity score analysis is appropriate for answering the question of interest, what quantity is to be estimated, and whether the assumptions for the analysis can be met. The next sections describe making these decisions.

## Types of questions that can be answered

There are several ways to ask causal questions: one is to ask what are the causes of observed variation in outcomes (e.g., why certain patients experience remission and others don't), and another is to ask what the causal effect of one variable is on an outcome (e.g., whether taking aspirin reduces heart attack risk). Propensity score analysis is only appropriate for answering the second type of question, i.e., "What is the effect of a treatment on an outcome?" It is not useful for discovering predictors of an outcome, for developing clinical prediction models, for identifying which variables are important drivers of the outcome, etc. To perform propensity score analysis, one must have a **single** well-defined treatment.

The type of quantity propensity score analysis is best suited to answer is the *total effect* of the treatment on the outcome. The total effect refers to an effect that ignores any intermediate pathways. For example, a drug might affect survival by stimulating the immune response in a patient; propensity score analysis can help answer the question of whether the treatment affects the immune response or whether the treatment affects survival, but it cannot answer the mechanistic question of whether the effect of treatment on the survival is due to its effect on the immune response. That type of question is a "mediation" question, and though propensity score analysis can play a role in such an analysis, in its basic and most common form, it is not able to answer such questions.

Propensity score analysis is sometimes used in disparities research, i.e., to identify whether a disparity exists on a single individuals dimension (e.g., sex or race) after controlling for differences in relevant variables between these groups. For example, one might notice that wages differ between men and women in a certain field; propensity score analysis can be used to create groups of men and women that look similar on variables that might be relevant to explaining away that disparity, such as age, experience, education, etc., so that the only explanation left for the disparity is a gender-based bias by the employer.

This type of analysis has several problems and will not be discussed further here (see @huberCausalPitfallsDecomposition2015 for a discussion of these problems). However, it does identify that propensity score analysis can be used to create comparable groups no matter what those groups are to be used for; the analysis itself is agnostic to how those groups will be compared and whether that comparison represents a valid causal effect or disparity. It is the assumptions behind the analysis that allow for causal inference; these assumptions are described later in this section.

## Choosing an estimand

An **estimand** is simply the quantity being estimated, i.e., the quantity of interest from a study. Although one might simply say that the estimand is the treatment effect, there are nuances to an estimand that are critical to articulate to be able to validly interpret the resulting estimate and to be able to make choices in the analysis that target the estimand. The concept of an estimand is also present in randomized trials, and many of the considerations for these estimands also apply to those in observational studies, though there are some additional considerations. @kahanEliminatingAmbiguousTreatment2023 and @hanDefiningEstimandsClinical2023 provide nice overviews of the components of an estimand in clinical trials.

Some components of an estimand common to observational studies and randomized trials include who the effect is estimated for in the presence of noncompliance (e.g., just those who actually received the treatment or all those who were assigned treatment), how the effect is measured (e.g., as a risk ratio, hazard ratio, or odds ratio), the time scale of the outcome (e.g., death at 12 months or 24 months), and how intercurrent events (e.g., death before a non-death clinical endpoint) are incorporated. Two components of an estimand that are particularly important to consider in propensity score analysis are whether the effect is to be marginal or conditional and for which subset of the study population the effect is meant to generalize; we focus on these in this tutorial, but that is not a reason to ignore the others.

### Marginal and conditional effects

A marginal effect is a comparison between the expected potential outcome under treatment and the expected potential outcome under control. This is the same quantity estimated in randomized trials without blocking or covariate adjustment and is particularly useful for quantifying the overall effect of a policy or population-wide intervention. A conditional effect is the comparison between the expected potential outcomes in the treatment groups within strata. This is useful for identifying the effect of a treatment for an individual patient or a subset of the population.

Although conditional effects are often more useful for clinical decision-making, they require far stricter assumptions about the relationship between confounders and the outcome and are not well suited for propensity score analysis. Estimating conditional effects either involves performing the analysis within subgroups of the sample (which can dramatically shrink the available data and yield imprecise estimates) or using an outcome model that presupposes a very specific and unrealistic functional form (e.g., using the coefficient on treatment in a logistic regression model with confounders included). We will focus instead on marginal effects, which can be estimated using the full dataset and don't require such assumptions.

### The target population

The target population refers to the population to which the effect is meant to generalize. Selecting this population is critical in propensity score analysis because it determines how specific steps of the analysis proceed and how to interpret them. There are four common estimands that can be targeted in propensity score analysis:

-   The average treatment effect in the population (ATE) - the average difference in outcomes between a scenario in which all units were treated and a scenario in which no units were treated. This is useful for determining universal policies or broadly understanding the effect of a treatment (e.g., Does this treatment work on average?).

-   The average treatment effect in the treated (ATT) - the average difference between the observed outcomes for the treated units and the outcomes the treated units would have had had they not been treated. It can be interpreted as the effect of withholding treatment from those who would otherwise receive it. This is useful for estimating the effects of potentially dangerous exposures or experimental procedures that would be given to patients like those currently receiving it.

-   The average treatment effect in the control (ATC) - the average difference between the observed outcomes for the untreated units and the outcomes the untreated units would have had had they been treated. It can be interpreted as the effect of expanding treatment to those who would not otherwise receive it.

-   The average treatment effect in the overlap (ATO) - the average difference in outcomes for those in an "overlap" population (i.e., a subpopulation of units approximately equally likely to be treated or not) were they to receive treatment and were they not to receive treatment. Although the scope of the ATO is limited and sometimes vague (i.e., because there are many ways to statistically define the overlap population), these estimates tend to be the least biased, most precise, and most resistant to biases due to unobserved variables operating in the extremes.

These estimands, a guide of how to choose among them, and the specific techniques that can be used to estimate them are described in detail in @greiferChoosingEstimandWhen2021.

## Meeting assumptions

There are a number of assumptions that are critical to being able to interpret the effects estimated using propensity score analysis as causal. Other methods for estimating causal effects may involve different assumptions; the assumptions listed here apply only to propensity score analysis and other methods of adjustment that rely on adjusting for observed variables, such as regression adjustment [@matthayAlternativeCausalInference2020]. If these assumptions are violated, the link between the statistical quantity estimated by propensity score analysis and the causal quantity desired by the researcher is broken.

Propensity score analysis can only be used to estimate causal effects when these assumptions are met. In this sense, propensity score analysis is not a "causal inference method", it is a method of estimating a statistical quantity (the adjusted association between the treatment and outcome), which can only be interpreted as a causal effect when these assumptions are met. The key assumptions are **satisfaction of the backdoor criterion**, **positivity**, and **consistency**. These are in addition to other assumptions that underlie the methods involved, such as assumptions about missing data if any are present and assumptions about correct measurement of the confounders, treatment, and outcome.

### The backdoor criterion

The backdoor criterion is that there are no "backdoor paths" from the treatment to the outcome. A backdoor path is a causal chain from the treatment to the outcome through a common cause of the treatment and outcome. Satisfaction of the backdoor criterion means that, conditional on the set of variables to be adjusted for, there are no backdoor paths from the treatment to the outcome, and the only association between the treatment and the outcome is due to the causal effect of the treatment on the outcome. In addition, no variables that induce bias have been adjusted for; these include variables caused by the treatment or the outcome. @vanderweelePrinciplesConfounderSelection2019 provides a clear guide on how to meet this assumption.

Satisfaction of the backdoor criterion is also known as the assumption of "strong ignorability" [@rosenbaumCentralRolePropensity1983], "conditional exchangeability" [@hernanEstimatingCausalEffects2006], "selection on observables", or, simply, no unmeasured confounding. Meeting this assumption requires a researcher to have collected a sufficient set of variables that closes all backdoor paths without opening any biasing paths. This assumption is often considered hard to meet when treatment has not been randomly assigned or the assignment mechanism is unknown, which is why methods that rely on this assumption when used in observational studies are often viewed with suspicion and why claiming causality from observational studies is dangerous.

There are a few strategies for meeting this assumption. One is to include all measured variables in the analysis, hoping that the set of measured variables is sufficient to satisfy the criterion. This is known as the "kitchen sink" approach. When the temporal ordering of the variables is clear (i.e., it can be assured that all of the variables to be adjusted for precede treatment), this can be an effective strategy, especially if many variables jointly act as proxies for possibly unmeasured confounders [@brookhartConfoundingControlHealthcare2010a]. It is critical that all variables adjusted for are not even possible affected by the treatment or outcome. Another, more principled strategy is to draw a causal diagram, known as a directed acyclic graph (DAG), that represents what is known about the causal system under study and can be used to select the specific variables that are and are not necessary to close all backdoor paths [@greenlandCausalDiagramsEpidemiologic1999]. Either way, researchers must be prepared to justify why they adjusted for the variables they did and why adjustment for these variables is sufficient to satisfy the backdoor criterion.

### Positivity

Positivity is the assumption that all units are eligible to be either treated or untreated [@westreichInvitedCommentaryPositivity2010]. The idea is that if some units are ineligible to be treated, it doesn't make sense to try to infer what would have happened to them had they been treated.

Positivity is an assumption about treatment assignment, but there is an empirical version of it often known as "overlap" [@fogartyDiscreteOptimizationInterpretable2016]. Overlap is the extent to which the distributions of covariates in the treated and untreated groups overlap with each other. Even if positivity holds (i.e., all patients in the study population are theoretically eligible for either treatment), it may be that in the sample, there are individual profiles that are absent from one group [@westreichInvitedCommentaryPositivity2010]. In the absence of good overlap, it will be challenging or impossible to use propensity score analysis to make the treatment groups resemble each other on the measured confounders; one is forced either to extrapolate inferences about the groups or to change the target population to one with some overlap (e.g., by targeting the ATO) [@kingDangersExtremeCounterfactuals2006].

### Consistency

Consistency is the assumption that there are no unmeasured versions of treatment, i.e., that the treatment values are well-defined [@hernanDoesObesityShorten2008]. Consistency might be violated if there are multiple doses the treatment could take but it is only measured as its presence or absence [@coleConsistencyStatementCausal2009a]. A component of consistency is the stable unit treatment value assumption (SUTVA), which requires that the treatment statuses of other individuals do not affect the outcomes of a given individual [@rosenbaumInterferenceUnitsRandomized2007]. This would also constitute a different "version" of treatment; for example, being given a vaccine when no other patients are vaccinated is a different version of the treatment from being given a vaccine when all other patients are vaccinated (assuming the patients can interact with each other) [@coleConsistencyStatementCausal2009a].

When treatment versions are identifiable (e.g., a measured dose), there are extensions for propensity score analysis that can be used for multi-category or continuous treatments [@imaiCausalInferenceGeneral2004]. Methods have also been developed for estimating causal effects in the presence of interference, a SUTVA violation [@tchetgenCausalInferencePresence2012]. These more advancd topics will not be covered in this guide.

## Summary

Propensity score analysis is not a general-purpose causal inference method; it is a statistical method that can be used to answer a specific type of question, i.e., the total effect of a treatment on an outcome. The research question must be articulated clearly with an estimand specified prior to the analysis. The estimand must consist not only of the components that are common to randomized trials (e.g., the effect measure, the time scale of the outcome, etc.) but also of the components that are more specific to the analysis of observational studies, which include whether the effect is marginal or conditional and to which target population the effect is meant to generalize. Theses choice are described clearly by @kahanEliminatingAmbiguousTreatment2023 and @greiferChoosingEstimandWhen2021, which are highly recommended.

Propensity score analysis requires certain assumptions for the effect estimate to be validly interpreted as causal, which include satisfaction of the backdoor criterion, positivity, and consistency [@hernanEstimatingCausalEffects2006]. These assumptions must be assessed by appealing to substantive knowledge about the causal system under study and the variables that are available to the researcher. Satisfaction of the backdoor criterion requires that there is no unmeasured confounding and no variables caused by the treatment or outcome are adjusted for. Positivity requires that all units are eligible to receive either treatment. Consistency requires that there are no unmeasured versions of treatment, including versions defined by the treatment status of other patients in the study. These assumptions and the choices required to satisfy and assess them are described in @vanderweelePrinciplesConfounderSelection2019, @westreichInvitedCommentaryPositivity2010, and @hernanDoesObesityShorten2008, which are highly recommended.