Suppose a social media platform’s Ads analytics team wants to know: Does seeing a certain ad (or promoted post) cause users to convert or engage more? This causalSuppose a social media platform’s Ads analytics team wants to know: Does seeing a certain ad (or promoted post) cause users to convert or engage more? This causal

Using Propensity Score Matching to Measure Down Stream Causal Impact of an Event

Suppose a social media platform’s Ads analytics team wants to know: Does seeing a certain ad (or promoted post) cause users to convert or engage more? This causal question is tricky because users who see the ad might inherently differ from those who don’t. In practice, simply comparing conversion rates of exposed vs. unexposed users can be very misleading. Ad exposure is not randomly assigned – algorithms may show ads more to highly active users, or users self-select into seeing or clicking ads. As a result, “unobservable factors make exposure endogenous,” meaning there are hidden biases in who sees the ad. Ideally, we’d run a randomized controlled trial (RCT) (e.g. hold out a control group who never sees the ad) to measure the causal effect. But often RCTs aren’t feasible for broad ad campaigns. This is where propensity score matching (PSM) comes in – it’s basically a statistical way to create apples-to-apples comparisons when you can’t run a proper A/B test. 

In this article, we’ll walk through how a data scientist in a social media Ads division can use propensity scores to estimate the impact of ad exposure on user conversion. We’ll use a simulated dataset of users with information like age, prior engagement, and device type, and we’ll demonstrate how to: 

  • Estimate each user’s propensity (likelihood) of seeing the ad based on their characteristics. 
  • Match ad-exposed users to similar unexposed users using these propensity scores. 
  • Check covariate balance with a before-and-after comparison (including a table of covariate differences and a balance plot). 
  • Estimate the difference in conversion rates attributable to ad exposure on the matched sample. 
  • Discuss key assumptions, limitations, and where propensity score methods fit in the broader causal inference toolbox. 

Propensity Scores: Mimicking a Randomized Experiment 

In an RCT, random assignment of exposure would ensure the exposed and control groups are statistically equivalent (on both observed and unobserved factors) before the treatment. Propensity scores aim to mimic that balance using observational data. Formally, the propensity score is defined as the probability of treatment assignment (here, ad exposure) conditional on observed covariates. In plain terms, it’s each user’s predicted likelihood of seeing the ad given their profile (age, engagement, device, etc.). By matching or adjusting on this single score, we ideally achieve a situation as if we had randomized who sees the ad. Rosenbaum and Rubin (1983) showed that the propensity score is a “balancing score” – conditional on users having the same score, their observed covariates should be balanced between exposed and unexposed groups. 

How do we estimate propensity scores? The most common approach is to train a logistic regression model to predict the probability of receiving the treatment based on covariates. In our case, we’ll model the probability a user was shown the ad as a function of their attributes. (More complex machine learning models like random forests or gradient boosting could also be used for propensity estimation, especially if there are nonlinearities, but logistic regression is a good starting point.) Each user then gets a score between 0 and 1 – for example, a very active 25-year-old mobile user might have an 80% predicted chance of seeing the ad, whereas a less engaged older desktop user might have only 10%. 

Why not just control for the covariates directly? In principle, you could run a regression of conversion on ad exposure plus all the covariates. That is another valid approach, and in fact propensity score methods are asymptotically equivalent to regression adjustment under certain conditions. The advantage of propensity scores is primarily in diagnostics and study design: PSM forces you to check balance and overlap between groups before looking at outcomes. It helps illustrate whether you have comparable groups, whereas a straight regression might mask a lack of overlap or extrapolate into regions with no data. In short, propensity score matching tackles selection bias head-on by explicitly pairing or weighting users to create a pseudo-experiment. 

Data Setup: Simulating an Ad Exposure Scenario 

To make this concrete, let’s simulate a dataset for our social media platform scenario. Imagine we have 1,000 users with the following characteristics: 

  • Ad Exposure (treatment): A binary indicator of whether the user was exposed to a particular ad campaign (1 = saw the ad, 0 = did not see the ad). In our simulation ~30% of users get exposed, but importantly, this is not random. 
  • Age: User age in years (ranging from 18 to 65 in our simulated data). 
  • Prior Engagement: A score or count representing the user’s recent engagement on the platform. For example, this could be the number of posts/interactions last week on a 0–10 scale (0 = not engaged, 10 = highly engaged). 
  • Device: A categorical variable for primary device used (we’ll simplify to Mobile vs. Desktop). Let’s say about 70% of users use mobile. 

We construct the exposure in a biased way: we’ll assume the platform’s ad delivery algorithm tends to show the ad more to certain users. Specifically, younger and highly engaged users on mobile are more likely to be exposed to the ad. This reflects a real-world scenario, perhaps the ad campaign targets active mobile users, or active users simply spend more time and thus have more chance to see the ad. In our simulation, the probability of exposure is generated by a logistic model: 

with coefficients chosen such that indeed higher engagement and mobile usage increase the odds of exposure, while age has a slight negative effect (older users slightly less likely to see the ad). We won’t go into the code here but suffice it to say our simulation intentionally builds in confounding: exposed and unexposed users will have different covariate profiles on average. 

After simulating, we fit a logistic regression to estimate each user’s propensity score (using age, engagement, device as predictors and ad exposure as the target). This gives us a propensity score for every user – basically the model’s guess of how likely that user would be treated, given their traits. Now, before matching, it’s wise to check the distribution of propensity scores in the treated vs. control groups. This helps assess common support – do the groups have overlapping score ranges, or are they totally separated? If there is no overlap (e.g. all treated have higher scores than all control), then no amount of matching can salvage the comparison. In our data, we do see considerable overlap: many users have moderate propensity values regardless of actual exposure, though the exposed group skews higher on average. 

Fig: Propensity score distribution for users who were exposed to the ad (treated) vs. not exposed (control) 

The histogram above shows the propensity score distributions for the two groups. The blue bars (unexposed controls) are more concentrated at lower scores (left side), indicating many unexposed users had a low likelihood of being shown the ad. The orange bars (exposed group) skew more to the right – these users often had profile characteristics giving them a higher chance of exposure. Crucially, the two distributions overlap significantly in the middle range. This overlap means we should be able to find, for many treated users, at least one untreated user with a similar propensity score. Those are the matches that will form our balanced comparison set. (If there were exposed users with propensity scores higher than any control – an off-support region – we’d have to exclude those from the analysis because we have no comparable control for them.) 

Matching Exposed and Unexposed Users 

With propensity scores in hand, we proceed to match users who saw the ad with users who did not, aiming to pair individuals with similar scores. There are several matching strategies in practice: 

  • Nearest-neighbor Matching: for each treated user, find an untreated user with the closest propensity score. 
  • Caliper Matching: only match treated-control pairs if their score difference is below some threshold (caliper), discarding treated units that don’t have a close enough control. 
  • One-to-many Matching: matches each treated user with multiple controls (or vice versa) to utilize more data, often weighted in analysis. 
  • With or without replacement: controls could be reused for multiple treated matches (with replacement) or each control used at most once (without replacement). 

For simplicity, our example uses 1:1 nearest-neighbor matching without replacement: each ad-exposed user is matched to one unique unexposed user with the most similar propensity score. We ended up matching 316 exposed users to 316 unexposed users, and those 316 pairs form our matched sample (about 63% of the original 1,000 users). Users who didn’t get matched (e.g. some of the lowest-propensity controls and a few highest-propensity treated, if any) are set aside. This kind of matching trades off sample size for quality of comparison – we prefer to drop some data if it means the remaining pairs are apples-to-apples. 

Now, the critical question – Did matching balance our covariates? We need to verify that in the matched sample, the exposed and control groups look similar in terms of age, engagement, and device. A common diagnostic is to examine the standardized mean difference (SMD) for each covariate before and after matching – essentially the difference in means between groups, scaled by the pooled standard deviation. As a rule of thumb, an absolute SMD below 0.1 is considered a negligible difference (i.e. good balance). We can also just look at the raw means/proportions to get an intuition. The table below summarizes our covariate balance: 

Covariate  Exposed (Before Matching) Unexposed (Before) Exposed (After Matching) Unexposed (After) 
Age (years) 38.4    42.4    38.4    39.5  
Prior Engagement (Average) 6.3    4.3    6.3    6.3   
Mobile Device (% users) 79.7% 64.8% 79.7% 73.7% 

Table: Covariate balance before and after propensity score matching.  

We can visualize the improvement in balance using a love plot (covariate balance plot). Below, each covariate’s imbalance is plotted as a point (the absolute standardized difference between groups) before and after matching: 

Fig: Standardized differences in covariates before vs. after matching 

As the love plot shows, propensity score matching achieved much better balance on the observed covariates. This gives us more confidence that when we compare outcomes between the matched exposed vs. unexposed users, we’re drawing a fair comparison that isn’t driven by pre-existing differences (at least not the observed ones we adjusted for). In our example, mobile device usage still has an absolute SMD around 0.14 post-match, a bit above the 0.1 target – this is a sign that our matching wasn’t perfect for that covariate. In practice, one might address this by trying a caliper (to force closer matches on propensity) or including device in a subsequent outcome regression as an additional adjuster (a technique sometimes called “double adjustment”). 

Outcome Analysis: Estimating the Ad’s Effect on Conversions 

Finally, we can measure the impact of ad exposure on the user outcome of interest – say conversion rate (perhaps the probability of clicking the ad or making a purchase). In our simulated data, we’ll assume a scenario where, on average, the ad does have a positive effect on conversion. To make it concrete, suppose the true causal effect is that the ad increases the conversion probability by 5 percentage points (we built this into the simulation). However, because exposure was confounded with engagement, a naïve comparison of conversion rates in the raw data would overstate the effect. Let’s see what the numbers look like: 

  • Unmatched data: Among all users who saw the ad, the conversion rate was 23.7%, compared to 12.9% for those who didn’t see the ad. That’s a +10.8-percentage point difference. If one naively took this at face value, you’d think the ad was hugely effective. But remember, the exposed group contained more highly engaged users, who were likely converting at higher rates even without the ad. 
  • Matched data: In the propensity score matched sample, the exposed users had a 23.7% conversion rate, while their matched unexposed counterparts had about a 15.2% conversion rate. That’s a +8.5-percentage point lift attributable to the ad in the matched sample. This is notably lower than the naive 10.8 points, reflecting the fact that some of the originally observed gap was due to differences in user characteristics. We’re closer to the true effect (which we set as 5% in the simulation), though in this run our matched estimate is still a bit high, likely because that residual device imbalance and any random noise can still bias us upward. 

The key point is that propensity score matching moved us in the right direction – it reduced the bias in our estimate of the ad’s effect. By comparing only comparable users, we got a more realistic estimate of how much conversion uplift the ad exposure causes. In real analyses, you wouldn’t know the “ground truth” effect, but you would see that after matching, the exposed vs. control outcome difference changed (often it shrinks, as in our case). This gives you a sense that selection bias was indeed present and PSM helped adjust for it. 

One should also compute confidence intervals or perform statistical tests on the matched difference, but those details are beyond our scope here. Additionally, if some exposed users had to be dropped due to no matches (lack of common support), you’d technically be estimating the Average Treatment Effect on the Treated (ATT). In our case, since almost all exposed were matched, ATT and ATE are about the same. Just keep in mind what population your causal estimate applies to. 

Assumptions and Limitations of Propensity Score Matching (PSM) 

PSM is a powerful technique, but it comes with important assumptions and limitations that any data scientist should be aware of: 

  • Observed Covariates Only (No Hidden Bias): Propensity scores can only account for variables you included and measured. This is often stated as the “no unmeasured confounders” or “conditional independence” assumption – essentially, you assume that after controlling for the observed covariates, treatment assignment is as good as random. If there’s some unmeasured factor strongly influencing both ad exposure and conversion (e.g. maybe only particularly savvy users both see the ad and convert), PSM can’t help you there. In our simulation, we included all the confounders in the model by design. In a real scenario, you must think hard about what variables might affect both exposure and outcome and make sure to include them in the propensity model. If you miss a big one, your causal estimates may still be biased. 
  • Model Specification: Even for observed covariates, you must specify the propensity model correctly (e.g. include appropriate interaction terms or nonlinear terms if needed). A mis-specified model might yield propensity scores that don’t fully balance the covariates. Diagnostics like checking each covariate’s balance (as we did) help to reveal if your model was adequate. If not, you may iterate on the model (add polynomial terms, interactions, or use a more flexible ML model) until balance is achieved. 
  • Common Support and Overlap: As noted earlier, PSM requires that for each treated unit, there are similar control units (and vice versa, if targeting ATE). If your treated and control populations are too different with little score overlap, matching will either drop many samples or fail to find good pairs. In such cases, you might restrict your inference to a narrower subgroup or conclude that observational data can’t answer this question without stronger assumptions. Always inspect propensity distributions and consider trimming off regions that lack overlap. 
  • Sample Size: You generally need a decent sample size to get reliable matches. If you only have a few hundred observations, matching algorithms might struggle, or your estimates might be very imprecise. In advertising measurement, where datasets are often large, this is usually less of an issue 
  • Matching Choices and Data Use: The way you do matching can affect results. Using one control per treated (1:1) vs. 3:1 or 5:1 matching, with or without replacement, choosing a caliper – these are tuning parameters that involve trade-offs. For instance, allowing replacement means a particularly common type of control user might serve as match for multiple treated users (increasing precision but possibly overweighting that profile). A wider caliper (or no caliper) ensures more treated units get matched but with potentially worse quality matches. A tighter caliper improves match quality but at the cost of dropping more treated units. There is no one-size-fits-all; it requires some experimentation and domain judgment. The good practice is to report how many observations were dropped due to matching and test that different reasonable choices don’t wildly change the estimate. 
  • Residual Confounding: Even after matching, as we saw with the device variable, some imbalance can remain. One solution is “double adjustment” – i.e. after matching, you can run a regression on the matched sample to adjust for any residual differences. Because the matched sample is already balanced, this regression is less dependent on model extrapolation and can correct minor imbalances. Another solution is weighting: if exact balance isn’t achieved, you might apply a small weight to some observations to fine-tune balance. These are advanced steps, but worth noting if you aim for the best possible adjustment. 

In summary, PSM is a tool, not magic. It shines when you have rich data on confounders and a scenario where randomization isn’t available. It lets you approximate an experiment and visibly demonstrate that your treatment and control groups are comparable on observed features. However, it doesn’t eliminate all bias – especially bias due to unobserved factors – and it requires careful implementation and validation. If the groups are fundamentally too different, even the fanciest matching won’t save the day. In those cases, you either need to gather more data, identify an instrumental variable, or consider a different study design. 

Where Propensity Scores Fit in the Causal Inference Toolbox 

Propensity score matching is just one approach among many for causal inference with observational data. It addresses one specific problem: how to deal with selection bias on observables by balancing covariates between treated and control groups. It’s worth situating this method in the broader context: 

  • Randomized Controlled Trials (RCTs): Always the gold standard when feasible. If you can randomly hold out a set of users from seeing the ad (a true control group), do it! That directly solves the selection bias problem by design. Propensity score methods are generally a plan B for when RCTs or controlled experiments are not possible due to cost, ethics, or logistical constraints.
  • Other Propensity Score Methods: Matching is one way to use propensity scores, but you can also use them for stratification (e.g. divide users into propensity score quintiles and compare outcomes within each stratum), inverse probability weighting (IPW) (weigh each user by 1/(propensity) for treated or 1/(1–propensity) for controls to create a weighted pseudo-population), or as covariates in outcome regression (a form of doubly-robust adjustment). These all rely on the same underlying propensity model. Each method has its nuances – for instance, IPW can use all data but may yield large variance if some scores are very small or large, whereas matching discards some data but tends to improve covariate balance quite transparently.
  • Difference-in-Differences (DiD): If you have longitudinal data (before/after an intervention) for both treated and control groups, DiD is another technique to control for unobserved time-invariant differences by looking at changes over time. For example, if the ad campaign ran in April and you have user engagement in March (pre) and May (post) for those who saw vs. didn’t see the ad, DiD could be applied. It assumes trends would have been parallel without the treatment. This method answers a slightly different question (it needs time series data and a clear intervention period) and can complement propensity scores or be combined with them (e.g. propensity score matching plus DiD on matched pairs). 
  • Instrumental Variables (IV): If there’s a variable that affects exposure but not directly the outcome (and not through confounders), it can serve as an instrument to tease out causal effects. In advertising, for example, random ad server load or some quasi-random targeting rule might act as an instrument. IV methods relax the “no unmeasured confounders” assumption but introduce their own strong assumptions (exclusion restriction). Propensity scores don’t directly help with IV – it’s an alternate approach when you can find a valid instrument. 
  • Synthetic Controls and Geo Experiments: In cases of market-level or product-level interventions (not user-level), techniques like synthetic control (including Bayesian structural time series, etc.) are used. For instance, comparing regions where an ad campaign ran to similar regions where it didn’t, constructing a weighted combination of control regions to act as a counterfactual. These are more applicable to aggregate causal questions and again are separate from propensity scores (though conceptually also about finding comparable units). 
  • Modern Machine Learning Causal Methods: There is a growing field of causal ML – methods like causal forests, uplift modeling, and double/debiased machine learning. Some of these extend the propensity score concept (e.g. using ML to estimate propensity or to predict counterfactual outcomes). The key for a data scientist is to understand the assumptions each method makes. Propensity score matching is grounded in traditional statistics but is very interpretable and, as we saw, easy to visualize for stakeholders (you can literally show the before/after balance). 

In our social media Ads context, propensity score matching provides a straightforward way to answer, “What is the causal effect of ad exposure on conversion?” when you can’t run a perfect A/B test. It allowed us to use observational logs of who saw the ad and who didn’t and construct a fair comparison to estimate lift. When used properly, PSM can yield estimates close to those from an experiment – but when used naively, or if important confounders are omitted, it can still lead to the wrong conclusions. As one study on Facebook ads demonstrated, observational methods often failed to match experimental results even with many covariates, underscoring the need for robust techniques and careful validation. 

Conclusion 

Propensity score matching is a valuable tool in the data scientist’s arsenal for causal inference. In our example, it helped adjust for biases in ad exposure and gave a more credible estimate of the ad’s impact on user conversions than a raw comparison would have. The process involved formulating a propensity model, matching users, and rigorously checking balance – steps that mirror the scientific rigor of a randomized experiment as much as possible in an observational setting. 

We also highlighted that PSM is not a plug-and-play solution: it rests on assumptions of no hidden bias, requires sufficient data overlap, and only balances what you include in the model. It should be combined with domain knowledge (to choose covariates) and followed by transparent reporting of diagnostics – for example, always report covariate balance and how many users were dropped, so stakeholders can trust the analysis. 

In the broader landscape, propensity scores are one approach to causal analysis among many. In a social media company’s analytics team, one might use PSM for some questions, difference-in-differences for others, or experimentation whenever possible. The common goal is to get closer to true causation and away from mere correlation. By using methods like PSM thoughtfully, data scientists can provide insights such as “Our best estimate is that this ad campaign caused about an 8-9 percentage point increase in conversion rate among the targeted users,” with evidence that they’ve adjusted for major biases. This kind of causal insight is far more actionable than saying “converted users saw the ad more often” (which is confounded). 

In summary, propensity score matching allows us to approximate an RCT using observational data. It’s an excellent technique for anyone in analytics to understand, especially in fields like digital marketing where true experiments may be difficult to implement for every campaign. When you use PSM, be rigorous about your assumptions and checks. Used in the right circumstances, however, it can greatly enhance your ability to draw causal conclusions and make better data-driven decisions in a social media ads context and beyond. 

Author: Dharmateja Priyadarshi Uddandarao 

Dharmateja Priyadarshi Uddandarao is a distinguished data scientist and statistician whose work bridges the gap between advanced analytics and practical economic applications.  

LinkedIn | Email 

Market Opportunity
Streamflow Logo
Streamflow Price(STREAM)
$0.01629
$0.01629$0.01629
-0.61%
USD
Streamflow (STREAM) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.