The Ultimate Literature Review for Causal Inference

As the recent announcement on the 2021 Nobel Prize in economics, causal inference techniques and empirical studies based on observational data have come back to researchers’ attention. The announcement says:

This year’s Laureates – David Card, Joshua Angrist and Guido Imbens – have provided us with new insights about the labour market and shown what conclusions about cause and effect can be drawn from natural experiments. Their approach has spread to other fields and revolutionised empirical research.

In my previous article, I have discussed what is causal inference, why do we need causal inference, and what are the methodologies used to conduct causal inference:

Causal Inference: What, Why, and How

Motivated by the Nobel Prize announcement and the academic works conducted by the researchers, in this article I want to focus on the academic background and industry applications using different causal inference techniques. Causal inference is widely used in tech companies to support data-driven decisions. This article is an exploration on the potential of combining causal inference and machine learning algorithms, extending the boundary of its use cases outside academia. Rather than going down to the technical details, this article serves as a high-level literature review with links to blog posts and research papers with more details. I suggest you add this article to your reading list and read the linked articles of your interests from time to time.

Generally speaking, we can classify methodologies related to causal inference into three categories:

Experiment/Field study/Randomized-Controlled Trials: Actively divide subjects into control and treatment group randomly to evaluate the causal link between treatment and outcome of interest. Randomness is key to ensure trustworthy results.
Quasi-Experiment: Causal inference based on observational data. Since there are intentional or unintentional randomness in the data, researchers need to choose methodologies carefully with specific assumptions. I will discuss different methodologies and their use cases in this article.
Natural Experiment: Empirical studies that expose subjects to treatment and control group taking advantage of natural events, like lottery draw, birth of the month, immigration law, etc. Since the randomness mainly comes from the natural events that have already happened, researchers can analyze the causal link with observational data.

In the following sections, I will discuss the three categories in details.

Part 1: Experiment

1. A/B Testing

The most common causal inference used in tech companies is A/B testing. There is nothing more powerful in aiding marketing decisions compared to conduct A/B testing among users. Tons of resources are out there tutoring A/B testing. Specifically, check out:

Udacity free course: A/B testing by Google
Advanced A/B testing workshop by Elea McDonnell Feit with R code
TextBook: Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

Knowing the basics about A/B testing is not enough to run the experiment. It is important to know whether the results you would get through the experiment are meaningful and trustworthy. As the saying goes:

"Bad test design is worse than no design"

There are certain areas to focus on before claim solid results:

In summary, check out the common pitfalls from A/B testing by Leihua Ye, Ph.D. Researcher (He writes great articles about A/B testing at Medium, more are cited in the following sections):

8 Common Pitfalls of Running A/B Tests

How to choose the right sample size? This involves setting up the right sample size before running the experiment, based on the desired confidence level, power, effect size, and sample/population variance. The calculation can be done through online calculators. To know more details about how sample size is correlated with different elements and the statistics behind, check out my previous blog post:

How is Sample Size Related to Standard Error, Power, Confidence Level, and Effect Size?

Another issue related to sample size is when should you stop your experiment? If your answer is stop whenever I see one version is significant, then you are facing the common statistical mistake: P-hacking. In A/B testing scenario, it is call peeking:

Unlocking Peeking in AB-Tests

However, it is possible to end experiment early under certain circumstances, check out this blog post from Netflix:

Improving Experimentation Efficiency at Netflix with Meta Analysis and Optimal Stopping

Run Randomness Check: It is important to know whether the samples are truly randomly assigned to ensure the difference we observe from the outcomes are purely driven by the treatment. Running A/A test is very useful, through which we give the same treatment for the two groups and expect the outcomes from the two groups maintain the same. Check out these blog posts for more details:

What is an A/A Test? Why Should You Care? | VWO

An A/B Test Loses Its Luster If A/A Tests Fail

Network Effect: a violation to SUVTA (Stable Unit Treatment Value Assumption). When running an experiment, we assume each users is randomly assigned to different groups and one user receiving the treatment will not affect other users’ outcome. This assumption makes A/B testing for dynamic pricing, offering promotions in a closing market, social media related testing unattractive. Check out how Lyft and LinkedIn solve this issue:

Experimentation in a Ridesharing Marketplace

AB-Testing Challenges in Social Networks

Sample Ratio Mismatching (SRM), it happens when the ratio of samples from treatment and control groups do not match your expectations, usually at 50/50 split. The mismatch can be problematic since it can bias the treatment effect. Check out this blog post for more details:

The essential guide to Sample Ratio Mismatch for your A/B tests

Dealing with multiple comparison: the chance of having higher false positive rate is much higher when running multiple comparisons in the same time. Certain adjustments are recommended to correct the biases it may induce:

Pitfalls of Multi-arm Experiments

2. Other experiments

A/B testing, due to its assumptions, are limited to certain use cases, like web page design, user engagement study, etc. By adopting the similar concept behind, we can extend experiments to a lot of other applications. For example, the paper from Dube and Misra (2017) run experiments on ZipRecruiter’s customers by showing different customers with different prices to estimate their customer’s price elasticity. Comparing to traditional machine learning algorithms that uses historical data to estimate customer willingness to pay, field experiments give more reliable results since we are aiming to measure the causal relationship between price and quantity. However, it is more time-consuming and costly to run field experiment, especially on pricing. When running the standard experiment is not feasible, we seek other methodologies to get reliable results.

Part 2: When experiment is not possible: Quasi-Experiments

1. Difference in Difference(DID)

DID is usually used when there are pre-existing differences between the control and treatment groups. We utilize pre-experiment data to control for these baseline differences in the absence of any interventions. The table here summarizes DID:

While it is widely used in economic research paper, it is becoming more recognizable in industries. Check out this use case in Airbnb:

Experimentation & Measurement for Search Engine Optimization

2. Uplift Modeling

Uplift modeling is a combination of causal inference and machine learning. Facing a problem of preventing customer from churning, traditional machine learning models use a series of features to predict how likely a customer will churn. For the same groups of customers, given the same features, uplift modeling measures the impact of market campaigns (the treatment) on customers’ likelihood of churning. Specifically, it divides customer into four categories:

Sure Things: Customers who will never churn
Lost Causes: Customers who will definitely churn
Sleeping Dogs: Customers who will churn giving the treatment (promotion, advertisement, etc)
Persuadables: Customers who will not churn giving the treatment

Uplift modeling helps predict which category the customer will fall into so that they can only target Persuadable customers to make the best use of the resources. Resources in uplift modeling include:

Quick Introduction

A Quick Uplift Modeling Introduction

Churn Prevention

Enterprise Causal Inference: Beyond Churn Modeling

With python code

Uplift Modeling

Political campaign

How uplift modeling helped Obama’s campaign – and can aid marketers

Other more comprehensive resources:

Causal Inference and Uplift Modeling Literature Review

Causal Machine Learning: Individualized Treatment Effects and Uplift Modeling

3. Regression Discontinuity (RDD):

Regression discontinuity is measuring the treatment effect at a cutoff, with the assumptions that subjects who are located around the cutoff are quite alike, but assigned to different group randomly:

This method is originated from an economic research paper analyzing the effect of scholarship program. Nobel prize winner David Card used RDD to analyze the impact of imposing minimum wage on employment through comparing two adjacent states: New Jersey and Pennsylvania. In industry, Cohen et al(2016) used RDD to estimate price elasticity for Uber customers using historical data. For more details, check out my blog post here:

An Introduction to Regression Discontinuity Design

4. Instrument Variables

Instrument variable is the variable that is highly correlated with the independent variable X but is not directly correlated with the dependent variable Y.

By measuring the effect of instrument variable Z on outcome variable Y (Z~Y), and measuring how correlated treatment X is with Z (Z~X), we can identify the causal link between (X~Y) without worrying other confounding variables. For applications in tech companies and more details about this method, check out this awesome article by Roblox:

Causal Inference Using Instrumental Variables – Roblox Blog

and this tutorial:

Machine Learning Meets Instrumental Variables

5. Synthetic Control:

Synthetic control is very useful when the treatment takes place at the aggregated level (rather than individual level) and there is only one treated case and a few control cases. This blog post has more details:

Causal Inference Using Synthetic Control: The Ultimate Guide

6. Switch-backs:

As mentioned above, when having network effect, we cannot run experiment at individual levels because users interact with each other. In this case, we can either run experiment at a more aggregated level and compare at the aggregate level, or run experiments at the same market but on different time, or both. Check out this example by DoorDash for more details:

Switchback Tests and Randomized Experimentation Under Network Effects at DoorDash

7. Local Average Treatment Effect (LATE):

Local average treatment effect was first introduced by the Nobel Prize winner Guido Imbens and Joshua Angrist in 1994. It is specifically useful in the situation when you cannot enforce the treatment on the subjects. For example, you can send out email promotions to certain customers and put them into the treatment group, but you will never know for sure whether they have opened the email and saw the promotion unless they made the purchase. The subjects can be classified into four groups:

Always Takers: always get the treatment no matter in treat or control group
Never Takers: always refuse the treatment no matter in treat or control group
Compliers: Take the treatment when assigned to the treatment group, and vice versa.
Defiers: Take the treatment when assigned to the control group, and vice versa.

Usually we assume there are no defiers, the always takers and never takers are just symmetric. In an experiment where we randomly assign treatment, when comparing the difference in outcomes between the treatment and control group, we can only measure the intent-to-treat (ITT) effect, but not the average treatment effect (ATE) because the existence of always takers and never takers. In this case, we can calculate the local average treatment effect, which is the treatment effect on compliers:

LATE = ITT/share of compliers

As the econ joke says:

"It’s better LATE than nothing!"

Part 3: When experiment is not possible: Natural Experiments

Natural experiment is the technique empirical economists use to determine causal relationships. Unlike clinical trials or A/B testing, natural experiment do not assign subjects to treatment and control group and collect their results afterwards, it uses nature events or other factors outside the control of the investigators. Natural experiment is widely used in academia:

David Card 1990: The impact of the mariel boatlift on the miami labor market
Joshua Angrist 1990: Lifetime earnings and the vietnam era draft lottery: evidence from social security administrative records
Angrist and Krueger 1991: Does compulsory school attendance affect schooling and earnings?

Hopefully, with its recent exposure through the Nobel Prize Announcement, we get to see more applications in the industry.

Part 4: What else about causal inference?

There are a lot more literature combining causal inference with machine learning and deep learning to improve algorithms, estimate heterogeneous treatment effect, etc., which are definitely on my reading list:

Causal Tree & Causal Forest:

Causal Tree Learning For Heterogeneous Treatment Effect Estimation

Heterogeneous Treatment Effects: Professor Sekhon’ work

Not Merely Averages: Using Machine Learning to Estimate Heterogeneous Treatment Effects

Causal inference with deep learning:

Causal ML for Data Science: Deep Learning with Instrumental Variables

Causal inference with transfer entropy and graph neural network

To be continued…

That’s all. What are the articles that are in your reading list regarding causal inference? Leave a comment below!

Thank you for reading this article! Lastly, don’t forget to:

Check these other articles of mine if interested;
Subscribe to my email list;
Sign up for medium membership;
Or follow me on YouTube;
Watch my most recent YouTube video about How much Have I Made Writing at Medium:

The Ultimate Literature Review for Causal Inference

Part 1: Experiment

Part 2: When experiment is not possible: Quasi-Experiments

Part 3: When experiment is not possible: Natural Experiments

Part 4: What else about causal inference?

Related Articles

How to Forecast Hierarchical Time Series

Must-Know in Statistics: The Bivariate Normal Projection Explained

Squashing the Average: A Dive into Penalized Quantile Regression for Python

HELP! We’ve Been HECS’d

A Visual Learner’s Guide to Explain, Implement and Interpret Principal Component Analysis

Sampling Distribution - sample mean

An interesting walk from Bayesian statistics: Differences between MAP and MLE.