1. Introduction to confounding and Directed Acyclic Graphs (DAGs)
Introduction
Data coming from a clinical trial is always a mixture of signal and noise. In order to analyse the data, one of the most important tasks is to maximise the signal and minimise the noise. But what is this noise? Well, it actually has two components.
The first component is random error. We have random error because we are working with a sample rather than the whole population. We cannot escape random error, but we can reduce it by increasing the sample size. This is also mentioned in the sample size module, and the module on sample surveys.
The second component is systematic error, or bias. Unfortunately, increasing the sample size does not reduce bias – we just end up with an even larger biased trial. So, we try to design trials to minimise bias, and measure and adjust for it if we can’t make it smaller by design. There are literally dozens of different types of bias that studies are prone to, and these can occur at any stage in the research process. The figure below shows the many different types of bias found in clinical trials.
However, arguably the most important is confounding bias, the topic of this module. Notably, although randomised controlled trials are free from confounding bias due to their design, they may still be subject to other types.
Confounding
The word confounding, comes from the Latin confundere, or old French confondre, which mean mixing up. It is a third variable that mixes up the association between exposure and outcome.
There are three common definitions of confounding, namely, the classical, collapsibility, and counterfactual. The classical definition is the one most commonly taught in textbooks of epidemiology.
Classical definition of confounding
The formal definition is: Bias of the estimated effect of an exposure on an outcome due to the presence of a common cause of the exposure and the outcome. This definition requires three things to be true:
- This common cause, or third variable, must be associated with the exposure. This is usually shown by demonstrating an inbalance in the confounding variable in the exposure groups;
- It must be a risk factor for the outcome in the unexposed population. This is often put much less precisely as having to be associated with the outcome;
- Finally, it cannot be on the causal pathway between exposure and outcome. If it is, then it is a mediating variable rather than a confounding variable.
An example of this classical approach is shown below.
Researchers found a strong association between the quantity of coffee consumed, and the incidence of bladder cancer. However, heavy consumers of coffee are more likely to smoke than light consumers of coffee, and smoking is a risk factor for bladder cancer. Hence smoking was muddling the association between coffee consumption and bladder cancer.
Collapsibility definition of confounding
A third variable is a confounder by the collapsibility definition if:
(a) the measure of association is homogeneous across the strata defined by the confounder; and
(b) the crude and common stratum-specific (adjusted) measures of association are unequal (this is called “lack of collapsibility”).
As an example, the table below shows the ten-year risk of developing lung cancer in workers exposed to a particular chemical compound.
Outcome |
Exposed |
Not Exposed |
Total |
Cancer |
27 |
14 |
41 |
No cancer |
48 |
67 |
115 |
Total |
75 |
81 |
156 |
The relative risk of lung cancer given the exposure is: RR = (27 / 75) / (14 / 81) = 2.1
Now let’s see what happens when we stratify by smoking status:
Non-smokers
Outcome |
Exposed |
Not Exposed |
Total |
Cancer |
1 |
2 |
3 |
No cancer |
24 |
48 |
72 |
Total |
25 |
50 |
75 |
The relative risk of lung cancer given the exposure for non-smokers is: RR = (1 / 25) / (2 / 50) = 1.0. In other words, the relative risk has reduced from 2.1 to 1.0.
Smokers
Outcome |
Exposed |
Not Exposed |
Total |
Cancer |
26 |
12 |
38 |
No cancer |
24 |
19 |
43 |
Total |
50 |
31 |
81 |
The relative risk of lung cancer given the exposure for smokers is: RR = (26 / 50) / (12 / 31) = 1.3. The relative risk is now 1.3 compared to 2.1 in the original analysis. In other words, the relative risks in the two strata are very similar, close to 1, whereas the crude relative risk was 2.1. If the measure of association Smoking status is clearly a confounder by the collapsibility definition.
Counterfactual definition of confounding
Suppose we have six patients with osteoarthritis of the knee, and we measure their level of knee pain. We then choose three at random, and give them 1000 mg Panadol. One hour later we measure knee pain levels in all six subjects. Note that this is a typical randomised controlled trial design. Here are the results:
Subject |
Took Panadol |
Pain gone |
Alice |
No |
No |
Ben |
Yes |
Yes |
Charlie |
Yes |
No |
David |
Yes |
Yes |
Edward |
No |
Yes |
Fred |
No |
No |
Ben was given the Panadol, and his pain disappeared. Does that mean that taking Panadol relieves pain? The answer is that we cannot tell from a single individual. If we knew what would have happened if Ben had not taken the Panadol, then we would have a better idea. This is called a counterfactual. In fact, apart from cross-over trials, for any individual, we only know the results of their exposure or non-exposure, and not the counterfactual.
That being the case, rather than looking at individuals, we must compare the average effect of those exposed against those not exposed. In a randomised controlled trial, these are comparable because of the randomisation, with the only difference between the two groups being the exposure. However, in observational studies, the non-exposed group may not be a good counterfactual population, thus introducing confounding.
Control of confounding
Confounding can be eliminated or minimised by careful choice of study design, or it can be adjusted for as part of the statistical analysis.
Minimising confounding at the design stage
When designing a study, there are three approaches we can use to minimise the possibility of confounding. these are randomisation, restriction and matching.
In a randomised controlled trial, randomisation maximises the chance of an even balance of potential confounding variables between study groups. This has the effect of breaking any link between exposure and any confounding variables. Importantly, this is true for both measured and unmeasured confounding variables. However, even in randomised controlled trials, there may still be an imbalance in confounding variables between study groups purely by chance, or because of small sample sizes.
We can try and ensure an even balance of a confounding variable across exposure categories levels by restricting study subjects to only those falling within specific levels of a confounding variable.
For example, an investigator might only select subjects of exactly the same age or same sex.
Clearly, the main problems with restricting recruitment into a study are the impact on sample size, and whether or not the results can be generalised. Also, this approach is a bit tricky when there are many confounding variables. Many drug trials are now using adaptive sampling methods which use sequential restriction to ensure an even balance of covariates.
Matching on confounding variables is commonly used in case-control studies, and sometimes in prospective studies. It can be either pairwise or frequency matching. It has implications for sample size in that often a suitable match cannot be found. Matched data often require a special type of statistical analysis to allow for the lack of independence of observations. Like restriction, matching is also problematic when there are many variables that need matching on.
Minimising confounding at the analysis stage
As long as bias can be measured, it can be adjusted for, and this is certainly true for confounding bias. There are two relatively easy statistical solutions to adjusting for confounding, stratified analysis and multivariable analysis.
In a stratified analysis, we use the same principle as in the stratified design. Suppose that gender is a confounding variable. We would split the data into two strata, males and females, and then evaluate the exposure-outcome association within each stratum. Then, within each stratum, the confounder cannot confound because it does not vary.
For example, below is a graph of the relationship between birth order and the rate of Down’s syndrome births.
Clearly, later order babies have a higher risk of being born with Down’s syndrome. However, mother’s age is a confounder here. Let’s repeat the above graph, but for different maternal age groups. Here we see that within each maternal age stratum, the relationship disappears.
We can obtain an adjusted effect measure by undertaking a stratified analysis using a Mantel Haenzel approach. Stratified analysis works best when there are few strata (i.e. if only 1 or 2 confounders have to be controlled).
Multivariable adjustment of confounding
In multivariable analysis, we add all possible confounding variables into the regression model. This method used to be called multivariate analysis, but it has been changed to multivariable analysis to distinguish it from the situation where there are two or more dependent variables.
We now have regression models to handle most situations where multivariable modelling is required. However, the approach is not without its assumptions and limitations. Firstly, there should be an adequate number of observations for each parameter in the model to avoid over-fitting.
Secondly, if the covariates are highly correlated amongst themselves, we can run into ill-conditioning. One way round this, as we shall see shortly, is adjustment using the propensity score.
Another technique which is gaining in popularity is the use of treatment effects. Here, as well as examining the association between exposure and outcome, we look at and use predictors of being in the exposed group. For a randomised controlled trial, this is clearly unnecessary, because randomisation guarantees an even spread of potential confounders between exposed groups. However, for observational studies, there likely are predictors of exposure. Here we describe four of the many different ways you can adjust for treatment effects in Stata, but there are many more.
Propensity scores
Propensity scoring is a method of obtaining a single continuous smoothed summary score that can be used to control for a collection of confounding variables in a study. It is the probability that a person will be exposed given a set of observed covariates. The propensity score is usually obtained by logistic regression, where the dependent variable is whether or not the individual is in the exposed group, and the predictor variables are all the potential confounding variables. The predicted probability of being in the exposed group is the propensity score.
An assumption of the propensity score approach is that all confounding variables have been measured. In the setting of an observational cohort study this is a problematic assumption when using exposure at study baseline. The propensity score can be used as single covariate, or for matched analysis.
Inverse probability weighting (IPW)
Inverse probability weighting also used the propensity score, but in a very different way. Suppose that age is a confounding variable, and the average age of the exposed group is 50 years, but only 40 years in the non-exposed group. We first create a propensity score based on a single covariate, age.
If an individual in the exposed group has an age of 35 years, well below the expected mean age of 50 years, then their propensity score might be 0.2, in other words, there is a low predicted probability of them being in the exposed group.
The IPW for that individual would therefore be 1 divided by 0.2 or 5, so that individual would be represented in the data set 5 times. This would have the effect of lowering the mean age in the exposed group towards that of the non-exposed group, thus making the two groups more comparable.
Regression adjustment (RA)
In regression adjustment, we create two regression models, one for each exposure or study group. The regression is of the covariates on the outcome. We then average the two regression models to get the adjusted treatment effect.
Instrumental variable (IV)
Suppose X is the exposure variable, Y is the outcome measure, and Z is a set of confounding variables. An IV, is one that is associated with X, not associated with Z, and is associated with Y only through X. If you collect data on the IV and are willing to make some additional assumptions, then you can estimate the average effect of X on Y, regardless of whether you measured Z.
We can write the Z-Y association as a product of the Z-X and X-Y associations, and solve this equation for the X-Y association. Importantly, IVs can adjust for both measured and unmeasured covariates. IV analysis is usually undertaken by two-stage least squares. Studies using IVs are not common because of the difficulty of finding a suitable IV.
Directed acyclic graphs (DAGs)
Although the name sounds scary, DAGs consist of just two elements, variables (or nodes in mathematical speak), and unidirectional arrows or paths. We first need to understand some of the rules of DAGs.
These graphs are called DIRECTED because arrows are only allowed to go in one direction.
One could argue that this is not very realistic because of things like reverse causation, but it does keep it simple, and ensures that things go forward over time.
The graphs are called ACYCLIC because no Outcome (O) can be a cause of its own Exposure (E), and by that route, a cause of itself.
A natural path between two variables is a sequence of arrows, regardless of their direction, that connects them.
Think of it as a path you can walk along, or a bridge. Here we see a natural path between A and Z.
A causal or directed path between two variables is a natural path in which all of the arrows point in the same direction.
For example, in the right hand corner there are 4 causal paths between A and Z. They are:
A->Z
A->B->Z
A->C->D->Z
A->B->C->D->Z
A->C<-B->Z is not a causal path because one of the arrows is pointing in the wrong direction – but it is a natural path.
In this same diagram, B and C are known as children of A.
A is the parent of B and C.
D and Z are descendants of A
A is an ancestor of D and Z
There are three types of natural paths between Exposure and Outcome, these are causal paths, confounding paths and colliding paths. Let’s look at some examples.
Causal path
We have already previously defined a causal path. Here we see Exposure (E) linked to Outcome (O) via the Mediating variable (C). Mediating variables are those which form part of a causal pathway.
As an example, sexual promiscuity is a risk factor for Human Papilloma Virus, which itself is a risk factor for cervical cancer. So in this case, HPV is a mediating variable.
Confounding path
In a confounding path, the confounding variable is a parent of both Exposure and Outcome.
This is the same example we saw earlier. In DAG terminology, Smoking status is a parent of Coffee consumption and Pancreatic cancer.
Colliding path
In a Colliding path, C is a child of both Exposure and Outcome, and is known as a collider.
Here we see that admission to hospital is a collider as it is a child of both Pneumonia and having an Ulcer.
Backdoor path
A backdoor path is a non-causal path from E to O.
It usually starts with an arrow pointing toward the Exposure. It is a path that would remain if we were to remove any arrows pointing away from E (these are the potentially causal paths from E, sometimes called frontdoor paths). Backdoor paths between E and O generally indicate common causes of E and O. The simplest possible backdoor path is the simple confounding situation seen above.
Open path
An open path is any causal or directed path between Exposure and Outcome and implies that Exposure is associated with Outcome.
In the above case, we are implying that Exposure causes Outcome. The path is called “open” in the sense that an association between Exposure and Outcome is possible. The path is also open if there are one or more mediating variables on the causal path.
Conditioning
If there is a backdoor path, this means that there are now two paths open, the direct causal path between Exposure and Outcome, and the indirect backdoor path through the confounding variable. This what causes the bias.
We can BLOCK the backdoor path by conditioning on the confounding variable C. We show this by a square around the conditioned variable. We condition on C by any of the strategies we discussed earlier – for example, restriction, covariate adjustment, inverse probability weighting, etc.
Blocked path
A collider automatically blocks a path so that the only association is the direct causal path between Exposure and Outcome.
However, if you condition on a collider, it has the effect of opening the indirect path, thus potentially introducing bias.
Path-specific terms
Notably, a collider (or a confounder) on one path need not be a collider (or confounder) on another path.
Here we see that C is a collider on (A←B→C←D→Z and a confounder on (A←C→Z).
Summary
Here is a summary of what we have learnt so far.
A confounder is a common cause of Exposure and Outcome. The association between Exposure and Outcome may be biased through the backdoor path that is left open, but can be blocked by conditioning on the confounder. A collider is a common effect of Exposure and Outcome, and blocks the indirect path, so cannot cause bias. Conditioning on a collider, opens the backdoor path and may introduce bias.
Daggity
There is a wonderfully-name web site called DAGity (http://dagitty.net/). It allows us to draw the DAG, and then tells us what we should adjust for. Below I have drawn a DAG from a study looking at the recovery of 146 patients after a stroke, and whether or not if they exercised regularly before the stroke, their recovery was better. Exercise is whether or not they exercised regularly before the stroke, LOS is their length of hospital stay, Discharge is whether they were discharged back to their residence, Recovery is a composite summative score calculated from the Stroke Impact Scale, and PCS is the physical component score from the SF-36 quality of life scale measured post-stroke.
Note that daggity automatically colour codes the different natural paths:
Green for causal paths;
Red for biased paths;
Black for other paths.
When we ask DAGitty to analyse the above diagram, DAGitty tells us that in order to obtain the total effect of exercise on recovery (that is both direct and indirect effects), we should only condition on Sex. If we want to obtain the direct effect of exercise on recovery, then we should condition on either Discharge and Sex, or LOS and Sex.
This is the end of the module. We do hope that it has given you a better appreciation of confounding and how to adjust for it.