Suppose a social media platform’s Ads analytics team wants to know: Does seeing a certain ad (or promoted post) cause users to convert or engage more? This causal question is tricky because users who see the ad might inherently differ from those who don’t. In practice, simply comparing conversion rates of exposed vs. unexposed users can be very misleading. Ad exposure is not randomly assigned – algorithms may show ads more to highly active users, or users self-select into seeing or clicking ads. As a result, “unobservable factors make exposure endogenous,” meaning there are hidden biases in who sees the ad. Ideally, we’d run a randomized controlled trial (RCT) (e.g. hold out a control group who never sees the ad) to measure the causal effect. But often RCTs aren’t feasible for broad ad campaigns. This is where propensity score matching (PSM) comes in – it’s basically a statistical way to create apples-to-apples comparisons when you can’t run a proper A/B test.
In this article, we’ll walk through how a data scientist in a social media Ads division can use propensity scores to estimate the impact of ad exposure on user conversion. We’ll use a simulated dataset of users with information like age, prior engagement, and device type, and we’ll demonstrate how to:
In an RCT, random assignment of exposure would ensure the exposed and control groups are statistically equivalent (on both observed and unobserved factors) before the treatment. Propensity scores aim to mimic that balance using observational data. Formally, the propensity score is defined as the probability of treatment assignment (here, ad exposure) conditional on observed covariates. In plain terms, it’s each user’s predicted likelihood of seeing the ad given their profile (age, engagement, device, etc.). By matching or adjusting on this single score, we ideally achieve a situation as if we had randomized who sees the ad. Rosenbaum and Rubin (1983) showed that the propensity score is a “balancing score” – conditional on users having the same score, their observed covariates should be balanced between exposed and unexposed groups.
How do we estimate propensity scores? The most common approach is to train a logistic regression model to predict the probability of receiving the treatment based on covariates. In our case, we’ll model the probability a user was shown the ad as a function of their attributes. (More complex machine learning models like random forests or gradient boosting could also be used for propensity estimation, especially if there are nonlinearities, but logistic regression is a good starting point.) Each user then gets a score between 0 and 1 – for example, a very active 25-year-old mobile user might have an 80% predicted chance of seeing the ad, whereas a less engaged older desktop user might have only 10%.
Why not just control for the covariates directly? In principle, you could run a regression of conversion on ad exposure plus all the covariates. That is another valid approach, and in fact propensity score methods are asymptotically equivalent to regression adjustment under certain conditions. The advantage of propensity scores is primarily in diagnostics and study design: PSM forces you to check balance and overlap between groups before looking at outcomes. It helps illustrate whether you have comparable groups, whereas a straight regression might mask a lack of overlap or extrapolate into regions with no data. In short, propensity score matching tackles selection bias head-on by explicitly pairing or weighting users to create a pseudo-experiment.
To make this concrete, let’s simulate a dataset for our social media platform scenario. Imagine we have 1,000 users with the following characteristics:
We construct the exposure in a biased way: we’ll assume the platform’s ad delivery algorithm tends to show the ad more to certain users. Specifically, younger and highly engaged users on mobile are more likely to be exposed to the ad. This reflects a real-world scenario, perhaps the ad campaign targets active mobile users, or active users simply spend more time and thus have more chance to see the ad. In our simulation, the probability of exposure is generated by a logistic model:
with coefficients chosen such that indeed higher engagement and mobile usage increase the odds of exposure, while age has a slight negative effect (older users slightly less likely to see the ad). We won’t go into the code here but suffice it to say our simulation intentionally builds in confounding: exposed and unexposed users will have different covariate profiles on average.
After simulating, we fit a logistic regression to estimate each user’s propensity score (using age, engagement, device as predictors and ad exposure as the target). This gives us a propensity score for every user – basically the model’s guess of how likely that user would be treated, given their traits. Now, before matching, it’s wise to check the distribution of propensity scores in the treated vs. control groups. This helps assess common support – do the groups have overlapping score ranges, or are they totally separated? If there is no overlap (e.g. all treated have higher scores than all control), then no amount of matching can salvage the comparison. In our data, we do see considerable overlap: many users have moderate propensity values regardless of actual exposure, though the exposed group skews higher on average.
Fig: Propensity score distribution for users who were exposed to the ad (treated) vs. not exposed (control)
The histogram above shows the propensity score distributions for the two groups. The blue bars (unexposed controls) are more concentrated at lower scores (left side), indicating many unexposed users had a low likelihood of being shown the ad. The orange bars (exposed group) skew more to the right – these users often had profile characteristics giving them a higher chance of exposure. Crucially, the two distributions overlap significantly in the middle range. This overlap means we should be able to find, for many treated users, at least one untreated user with a similar propensity score. Those are the matches that will form our balanced comparison set. (If there were exposed users with propensity scores higher than any control – an off-support region – we’d have to exclude those from the analysis because we have no comparable control for them.)
With propensity scores in hand, we proceed to match users who saw the ad with users who did not, aiming to pair individuals with similar scores. There are several matching strategies in practice:
For simplicity, our example uses 1:1 nearest-neighbor matching without replacement: each ad-exposed user is matched to one unique unexposed user with the most similar propensity score. We ended up matching 316 exposed users to 316 unexposed users, and those 316 pairs form our matched sample (about 63% of the original 1,000 users). Users who didn’t get matched (e.g. some of the lowest-propensity controls and a few highest-propensity treated, if any) are set aside. This kind of matching trades off sample size for quality of comparison – we prefer to drop some data if it means the remaining pairs are apples-to-apples.
Now, the critical question – Did matching balance our covariates? We need to verify that in the matched sample, the exposed and control groups look similar in terms of age, engagement, and device. A common diagnostic is to examine the standardized mean difference (SMD) for each covariate before and after matching – essentially the difference in means between groups, scaled by the pooled standard deviation. As a rule of thumb, an absolute SMD below 0.1 is considered a negligible difference (i.e. good balance). We can also just look at the raw means/proportions to get an intuition. The table below summarizes our covariate balance:
| Covariate | Exposed (Before Matching) | Unexposed (Before) | Exposed (After Matching) | Unexposed (After) |
| Age (years) | 38.4 | 42.4 | 38.4 | 39.5 |
| Prior Engagement (Average) | 6.3 | 4.3 | 6.3 | 6.3 |
| Mobile Device (% users) | 79.7% | 64.8% | 79.7% | 73.7% |
Table: Covariate balance before and after propensity score matching.
We can visualize the improvement in balance using a love plot (covariate balance plot). Below, each covariate’s imbalance is plotted as a point (the absolute standardized difference between groups) before and after matching:
Fig: Standardized differences in covariates before vs. after matching
As the love plot shows, propensity score matching achieved much better balance on the observed covariates. This gives us more confidence that when we compare outcomes between the matched exposed vs. unexposed users, we’re drawing a fair comparison that isn’t driven by pre-existing differences (at least not the observed ones we adjusted for). In our example, mobile device usage still has an absolute SMD around 0.14 post-match, a bit above the 0.1 target – this is a sign that our matching wasn’t perfect for that covariate. In practice, one might address this by trying a caliper (to force closer matches on propensity) or including device in a subsequent outcome regression as an additional adjuster (a technique sometimes called “double adjustment”).
Finally, we can measure the impact of ad exposure on the user outcome of interest – say conversion rate (perhaps the probability of clicking the ad or making a purchase). In our simulated data, we’ll assume a scenario where, on average, the ad does have a positive effect on conversion. To make it concrete, suppose the true causal effect is that the ad increases the conversion probability by 5 percentage points (we built this into the simulation). However, because exposure was confounded with engagement, a naïve comparison of conversion rates in the raw data would overstate the effect. Let’s see what the numbers look like:
The key point is that propensity score matching moved us in the right direction – it reduced the bias in our estimate of the ad’s effect. By comparing only comparable users, we got a more realistic estimate of how much conversion uplift the ad exposure causes. In real analyses, you wouldn’t know the “ground truth” effect, but you would see that after matching, the exposed vs. control outcome difference changed (often it shrinks, as in our case). This gives you a sense that selection bias was indeed present and PSM helped adjust for it.
One should also compute confidence intervals or perform statistical tests on the matched difference, but those details are beyond our scope here. Additionally, if some exposed users had to be dropped due to no matches (lack of common support), you’d technically be estimating the Average Treatment Effect on the Treated (ATT). In our case, since almost all exposed were matched, ATT and ATE are about the same. Just keep in mind what population your causal estimate applies to.
PSM is a powerful technique, but it comes with important assumptions and limitations that any data scientist should be aware of:
In summary, PSM is a tool, not magic. It shines when you have rich data on confounders and a scenario where randomization isn’t available. It lets you approximate an experiment and visibly demonstrate that your treatment and control groups are comparable on observed features. However, it doesn’t eliminate all bias – especially bias due to unobserved factors – and it requires careful implementation and validation. If the groups are fundamentally too different, even the fanciest matching won’t save the day. In those cases, you either need to gather more data, identify an instrumental variable, or consider a different study design.
Propensity score matching is just one approach among many for causal inference with observational data. It addresses one specific problem: how to deal with selection bias on observables by balancing covariates between treated and control groups. It’s worth situating this method in the broader context:
In our social media Ads context, propensity score matching provides a straightforward way to answer, “What is the causal effect of ad exposure on conversion?” when you can’t run a perfect A/B test. It allowed us to use observational logs of who saw the ad and who didn’t and construct a fair comparison to estimate lift. When used properly, PSM can yield estimates close to those from an experiment – but when used naively, or if important confounders are omitted, it can still lead to the wrong conclusions. As one study on Facebook ads demonstrated, observational methods often failed to match experimental results even with many covariates, underscoring the need for robust techniques and careful validation.
Propensity score matching is a valuable tool in the data scientist’s arsenal for causal inference. In our example, it helped adjust for biases in ad exposure and gave a more credible estimate of the ad’s impact on user conversions than a raw comparison would have. The process involved formulating a propensity model, matching users, and rigorously checking balance – steps that mirror the scientific rigor of a randomized experiment as much as possible in an observational setting.
We also highlighted that PSM is not a plug-and-play solution: it rests on assumptions of no hidden bias, requires sufficient data overlap, and only balances what you include in the model. It should be combined with domain knowledge (to choose covariates) and followed by transparent reporting of diagnostics – for example, always report covariate balance and how many users were dropped, so stakeholders can trust the analysis.
In the broader landscape, propensity scores are one approach to causal analysis among many. In a social media company’s analytics team, one might use PSM for some questions, difference-in-differences for others, or experimentation whenever possible. The common goal is to get closer to true causation and away from mere correlation. By using methods like PSM thoughtfully, data scientists can provide insights such as “Our best estimate is that this ad campaign caused about an 8-9 percentage point increase in conversion rate among the targeted users,” with evidence that they’ve adjusted for major biases. This kind of causal insight is far more actionable than saying “converted users saw the ad more often” (which is confounded).
In summary, propensity score matching allows us to approximate an RCT using observational data. It’s an excellent technique for anyone in analytics to understand, especially in fields like digital marketing where true experiments may be difficult to implement for every campaign. When you use PSM, be rigorous about your assumptions and checks. Used in the right circumstances, however, it can greatly enhance your ability to draw causal conclusions and make better data-driven decisions in a social media ads context and beyond.
Dharmateja Priyadarshi Uddandarao is a distinguished data scientist and statistician whose work bridges the gap between advanced analytics and practical economic applications.
LinkedIn | Email


