# Correlation Vs Causation

December 19, 2022

By now we have probably all heard the old adage, “Correlation does not equal causation.” But what does this mean for the field of data science? Often, businesses are trying to solve complex business problems with machine learning, but machine learning is not always the best solution, especially for evaluating interventions. While machine learning is a great tool that has many applications, the issue is that the relationships between variables that are found through machine learning models are correlations, not causations. So, if you just need an accurate output without needing to understand the underlying factors causing that output, machine learning may be for you! In other scenarios, if you are trying to evaluate a business decision or action and the impact it had on revenue or other key metrics – what you are really trying to understand is the causal relationship between your intervention and the resulting outcome. This analysis is better suited for causal inference, which I will demo in this blog.

Let’s suppose that you work for Volusia County government in Florida (the shark attack capital of the world). One of your tasks is to reduce the incidence of shark attacks that occur on Volusia beaches. A Data Analyst is giving a presentation and shows the following chart:

Additionally, the analyst has an algorithm that can predict shark attacks with ~95% accuracy using ice cream sales as one of the predictor variables. You wonder how knowing this information and having an algorithm helps you reduce the incidence of shark attacks. Furthermore, one of your coworkers exclaims, “That’s it! We should ban the sale of ice cream on our beaches! Clearly it is causing shark attacks!”. Immediately, you are skeptical. It doesn’t seem like ice cream would have any impact on shark attacks. And your instincts are correct. Something else is occurring here. The answer lies within confounding variables. Ice cream consumption and the incidence of shark attacks both occur more often in warmer temperatures since people swim on the beach when it is warm outside. The confounding variable here is the temperature outside. A confounding variable is any variable that you’re not investigating that can potentially affect the outcomes of your research study and is exactly the reason why correlation does not equal causation!

So back to the issue at hand, how do we reduce the incidence of shark attacks? One hypothesis would be that increasing the number of life guards on duty would allow sharks to be spotted quicker and we would be able to get people out of the water faster – before they are attacked by sharks. So, you want to know if the increase in the number of lifeguards last summer was the reason shark attacks were reduced, and, if it is, you could further reduce shark attacks by securing funding to hire more lifeguards. But, how do we make sure that we take into account possible confounding variables and that the observed decrease in shark attacks wasn’t due to chance? Enter causal inference to save the day!

There are multiple ways we can get control for confounding variables. Three methods that are regularly used are:

- Back-door criterion
- Front-door criterion
- Instrumental variables

The back-door and front-door criterion comes from Judea Pearl’s do-calculus that you can read about in his book, “Causality: Models, Reasoning and Inference.” Instrumental variables were introduced as early as 1928 by Phillip Wright and are frequently used in econometrics.

**Back-door Criterion**

This method requires that there are no hidden confounding variables in or outside of the data. In other words, we cannot have any variables that influence both the intervention and the outcome that we haven’t controlled for. It’s not always possible to rule out every possible confounding variable, but with proper hypotheses, we can be reasonably certain.

**Front-door Criterion**

You can have a hidden confounding variable with this method as long as you have a third, mediating variable that mediates the effect of the intervention on the outcome and the mediating variable is not impacted by the confounding variable (Ex: level of alertness is a mediating variable between intervention lack of sleep and outcome academic achievement).

**Instrumental Variables**

You can also have a confounding variable as long as you have a third variable that is correlated with the intervention, is not correlated with the outcome, and is not impacted by the confounding variable (Ex: if you want to know the effect of classroom size on test scores you would need to find a variable that is highly correlated with classroom size but wouldn’t have an impact on test scores and is not impacted by confounding variable school funding and resources). These can be hard to come by.

**ATE and CATE**

At this point we should take a step back and understand ATE, CATE, and counterfactuals. Often times, we don’t just want to know if the intervention was statistically significant and successfully caused the outcome we are interested in. We also want to know the magnitude, or by how much, our intervention caused an outcome. In our shark attack example, we would want to know how many shark attacks we prevented by increasing the number of life guards. This is called the ATE or the Average Treatment Effect. If we wanted to know how our intervention impacted different beaches, then we would use the Conditional Average Treatment Effect (CATE), which just tells us the average treatment effect for a subset of the population. The ATE is calculated by taking the difference between the outcome with the intervention and the outcome without the intervention. So, in this example, we would take the difference between the outcome of increasing life guards and the outcome of not increasing life guards. But if we can only give one intervention at a time (we can’t simultaneously increase and not increase life guards), how can we know the outcome of the intervention that the beach did not get? This is calculated by counterfactuals. Counterfactuals are things that did not happen, but could have happened (Ex: Joe got the treatment and recovered in 10 days, but the counterfactual outcome is Joe not getting the treatment). I will not get into the weeds of how this is calculated, but at a high level it is calculated using covariates.

**Estimates**

Once we know whether we need the ATE or the CATE and we know which method we are using to control for confounding variables, then we can identify the method we will use to calculate our ATE. Typically, if we have low dimensionality/complexity in our data, we can use simple methods like matching, stratification, propensity matching, inverse propensity weighting, and the Wald estimator. If we want to calculate the CATE or we have high dimensional/complex data, we can use more advanced ML methods such as Double ML, Orthoforests, T-Learners, X-Learners, and Intent to Treat Driv. Backdoor methods we could use would be linear regression, distance matching, propensity score stratification, propensity score matching, or weighting. Instrumental Variable methods we could use would be Wald estimators and regression discontinuity. If front door criterion is met we could use a two stage regression. This is not an exhaustive list, but a list of potential methods we could use to calculate the ATE.

**Refutation**

Once we have calculated the ATE or the CATE, we have one more step to perform. This is the refutation step. Refutation tests check the robustness of the estimate. This is essentially a validation test that looks for violations in our assumptions when we calculated our estimates. Some refutation tests we can do to check the strength of our causal relationship are:

- Adding a random cause variable to see if that significantly changes the ATE/CATE
- Replacing interventions with random (placebo) variables to see if that significantly changes the ATE/CATE
- Removing a random subset of the data and see if that significantly changes the ATE/CATE

If these refutation testes come back insignificant (above .05), then the intervention is likely significantly causal.

**Summary**

Causal inference is much better suited for many problems that businesses face than a machine learning model. Furthermore, it allows us to identify and quantify a causal relationship between our intervention and the outcome we observe. You can see that going a step above identifying correlations and identifying causal relationships can be a very impactful exercise. If we know that increasing life guards is not just *associated* with a reduction in shark attacks, but that it *causes* this reduction in shark attacks, it gives us a direct action we can take to achieve our goal of reducing shark attacks (increasing life guards).

**Interested in Causal Inference?**

Strive’s Data & Analytics team can help you identify causal relationships in your business. Want to know if the marketing strategy rolled out in Q3 caused an increase in customers and revenue in Q4? Would you like to know if implementing a more robust PTO policy could decrease employee churn? No matter what causal question you have, we are happy to help! Our team of data analysts and data scientists are uniquely positioned to help you take action that will allow your business to reach its goals and beyond! Let us uncover valuable insights that will help your company today!

## Contact Us

Featured Authors