Publish AI, ML & data-science insights to a global community of data professionals.

Water Cooler Small Talk: Simpson’s Paradox

Is your data tricking you? What can you do about it?

STATISTICS

Image created by the author using GPT-4 / All other images created by the author unless specified otherwise
Image created by the author using GPT-4 / All other images created by the author unless specified otherwise

Water cooler small talk is a special kind of small talk, typically observed in office spaces around a water cooler. There, employees frequently share all kinds of corporate gossip, myths, and legends, inaccurate scientific opinions, indiscreet personal anecdotes, or outright lies. Anything goes. So, in my Water Cooler Small Talk posts, I discuss strange and usually scientifically invalid opinions I’ve overheard in the office that have left me speechless.


Here’s the water cooler moment of today’s post:

Let’s keep it simple – I just need one number that shows the big picture, the aggregated data. There is no need to overcomplicate things…

Sure, but what if the big picture is hiding the real story?🤷🏻 ‍♀️ What if things are in fact just complicated? Business users love the concept of ‘just one number’ – it’s simple, it’s clean, it’s easy to understand. Nonetheless, reality rarely aligns with ‘just one number’. More often than not, the real world is much more complex and nuanced, with layers and layers of information and details, and a single number can’t tell us much about what is really happening.

One of the most fascinating examples of aggregated data not telling us the full story is the Simpson’s paradox that I will explore in detail in this post.

So, buckle up, cause this one’s a ride! 🏇🏻


🍨 DataCream is a newsletter offering data-driven articles and perspectives on data, tech, AI, and ML. If you are interested in these topics subscribe here.

Maria Mouschoutzi, PhD – Medium


But is this really a paradox? 🤨

Simpson’s paradox, named after statistician Edward H. Simpson, is a statistical phenomenon where a trend that appears within multiple data groups, reverses or disappears when all data is combined. While this is a rather broad definition that incorporates various different cases, Simpson’s paradox is fundamentally about how aggregated data can hide or misrepresent subgroup-level patterns, and can pop up in any field where data is analyzed: medicine, sports, business, social sciences – anything with data. For instance, one of the most famous examples the Simpson’s paradox is the UC Berkeley gender bias study, where aggregated admission rates appear to show bias against women, but analysis by department reveals a different story. Another indicative example is this medical study comparing two treatments for kidney stones, with the most effective treatment being dependent on the level of analysis we choose.

Much like the birthday paradox or the Monty Hall problem, Simpson’s paradox is not really a paradox, but rather a veridical paradox. That is, there is no true contradiction in the underlying math and logic, nevertheless, the results feel deeply counterintuitive and self-contradictory. We inherently tend to trust aggregates, implicitly assuming that they provide a reliable representation of what is happening in the underlying data. But when this doesn’t actually happen, things feel off. This is why Simpson’s paradox matters – because it is so easy for us to rely on the surface level numbers, just because they seem right, without giving any extra thought on what is actually happening. Understanding Simpson’s Paradox isn’t just an intellectual exercise. On the contrary, it has practical implications for data-driven decision-making, allowing us to avoid misleading conclusions and built better models – if this is something we do.

Fundamentally, Simpson’s paradox occurs due to significant differences in the underlying subgroups of the data, either in terms of size or some other underlying characteristic. Broadly, the paradox can be attributed to two common causes:

  1. Imbalanced weights across subgroups of the data, skewing the aggregate results
  2. Confounding variables, affecting both independent and dependent variables

1. Aggregating imbalanced weights

Imbalanced data **** result in the Simpson’s paradox when certain subgroups contribute disproportionately to the overall dataset, overshadowing the patterns of the smaller subgroups. In other words, some subgroups are over-represented, whereas others are under-represented, leading into a probably misleading aggregated view.

Imagine a company evaluating the productivity of two employees, John and Jane, based on the number of tasks they complete in two projects – Project 1 and Project 2.

No wonder why Jane was furious in her performance review!🔥 Despite achieving a higher completion rate in both projects – 25.3% vs. 25.0% in Project 1 and 32.1% vs. 31.4% in Project 2 – the aggregated data suggest that John outperformed her overall (31.0% vs. 27.0%). How is this possible?

This contradiction arises because John handled a disproportionately large number of tasks in Project 2, where his completion rate was higher. Meanwhile, Jane’s stronger performance in each individual project is overshadowed by the weighting of the total number of tasks across projects. If we were to only look at the aggregated numbers, we’d be misled into thinking John was the stronger performer overall.

This is why disaggregating the data and analyzing subgroup-level patterns before drawing any conclusions is so important. After identifying such an issue we can effectively address it by using normalized metrics. By normalizing the data, we can balance the contributions of the different subgroups, ensuring that each subgroup is equally represented in the overall results.

Back to John’s and Jane’s performance evaluation, we can choose to assign equal weights to both projects, regardless of the task distribution, just by averaging the completion rates of Project 1 and Project 2. From this perspective Jane performs better (28.7% vs. 28.0%) because her superior completion rates in both projects are valued equally. Nevertheless, this is not necessarily a more correct approach – it all comes down to what we are interested in measuring at the end of the day. However, what’s critical to understand is that different aggregation methods – such as weighted vs. normalized – yield different results, conclusions, and interpretations.


2. Correlation does not mean causation

Another cause of the Simpson’s Paradox is our tendency to mix up correlation and causation. By definition, correlation measures the degree to which two variables move together. On the flip side, causation indicates that the change of a variable directly causes the change of another.

Apparently, correlation may or may not also be causation. Nonetheless, from only observing the correlation between X and Y, we cannot tell if X causes Y, Y causes X, or there is another, third, confounding variable Z, that causes both X and Y. We need some extra things in order to be able to claim causation between X and Y, like a plausible explanation of the mechanism of how exactly X causes Y, X chronically preceding Y, making sure that there are no other confounding variables, and even experimental evidence.

However, most of these are rather a luxury when one needs to analyze a fixed, pre-existing dataset. In our quest to understand what is happening and construct a meaningful story, we love to see causation wherever we can and frivolously draw conclusions. Ultimately, it seems that we naturally care more about a story that makes sense, rather than a story that is true.

We can easily create some dummy data to further illustrate this in Python. Suppose we want to explore the relationship between the number of hours of remote work allowed per employee per week, and some kind of collaboration score. Let’s also assume that the employees belong to three distinct roles – Data Scientists, Project Managers, and Sales Representatives.

import numpy as np
import pandas as pd

np.random.seed(42)

# generate dummy data where collaboration depends on hours of remote work
group_a = pd.DataFrame({
    "Group": "Data Scientists",
    "Remote Work Hours": np.random.uniform(8, 10, 100),  
})
group_a["Collaboration Score"] = 2 + (2 * group_a["Remote Work Hours"]) + np.random.normal(0, 1.5, 100) 

group_b = pd.DataFrame({
    "Group": "Project Managers",
    "Remote Work Hours": np.random.uniform(7, 9, 100),  
group_b["Collaboration Score"] = 6 + (2 * group_b["Remote Work Hours"]) + np.random.normal(0, 1.5, 100) 

group_c = pd.DataFrame({
    "Group": "Sales Representatives",
    "Remote Work Hours": np.random.uniform(6, 8, 100), 
})
group_c["Collaboration Score"] = 10 + (2 * group_c["Remote Work Hours"]) + np.random.normal(0, 1.5, 100) 

We can then create a scatter plot of the remote work hours versus the collaboration score with Plotly.

import plotly.express as px

# scatter plot w single trendline
fig = px.scatter(
    data,
    x="Remote Work Hours",
    y="Collaboration Score",
    trendline="ols",  
    title="Collaboration Score vs. Remote Work Hours (Single Regression Line)"
)

fig.update_traces(marker=dict(color="blue"))

fig.update_layout(
    xaxis_title="Remote Work Hours (per week)",
    yaxis_title="Collaboration Score",
    height=600,
    width=1000
)

fig.show()

Ah, it is obvious! The more hours the employees work remotely the less they collaborate. We should completely ban working remotely and bring everyone back to the office! 👍

But, let’s also incorporate the employee job titles in the plot.

# scatter plot w multiple trendlines
fig = px.scatter(
    data,
    x="Remote Work Hours",
    y="Collaboration Score",
    color="Group",
    trendline="ols", 
    title="Collaboration Score vs. Remote Work Hours with Controlled Slopes"
)

fig.update_layout(
    xaxis_title="Remote Work Hours (per week)",
    yaxis_title="Collaboration Score",
    height=600,
    width=1000
)

fig.show()

Oops!

It seems that the correlation is the reverse of the one we initially thought. Now that we also incorporated the dimension of different employee roles in our plot, it seems that the more hours the employees work remotely the more they collaborate.

So, what is going on?

Seeing the first chart, we immediately notice the correlation between remote work hours and collaboration. Depending on our feelings towards remote work, we can very easily interpret this correlation as causation – more hours of remote work cause less collaboration among employees.

In the second chart, as the employee group is introduced with color coding, it becomes clear that it is a confounding variable. That is, the employee group is what causes both the number of hours of remote work, and the collaboration score. For instance, data scientist roles often require long hours of focused, independent and individual work, where collaboration is less critical. Thus, such a role naturally needs less collaboration and can take advantage of more hours of remote work. On the contrary, a sales representative role relies heavily on team interaction and in-person collaboration, often requiring in-office or in-field presence. As a result, such a role inherently needs higher collaboration and cannot work that many hours from home.

In general, this issue can be treated by identifying variables that may be influencing both the independent and dependent variables, and incorporating them in the analysis or model. By doing so, we can acknowledge for their impact and uncover the true relationship between the variables of interest.

Then again, why complicate things? Maybe we should just show the first plot and bring everyone back to the office. 🙃

On my mind

It’s easy to toss around some SQL, cook a few numbers, and then settle on ‘just one number‘, to justify the things we’ve already decided to believe. However, real life rarely is this simple and straightforward – reality is much more nuanced and complex, often requiring us to dig deeper to uncover meaningful insights and tell stories that may be inconvenient and challenge our assumptions.

In general, aggregating data removes subgroup distinctions and may hide meaningful differences and relationships that are visible at the subgroup level. This is why disaggregating the data should always be a first and non-negotiable step in any analysis. This allows to analyze subgroup-level patterns, that may otherwise be hidden, before drawing any conclusions.

In the words of Prof. Jordan Ellenberg, ‘there’s no contradiction involved, just two different ways to think about the same data‘. There is no need to choose between aggregated or disaggregated data – instead we should incorporate both perspectives to achieve a more nuanced and accurate representation of reality. Ultimately, the Simpson’s paradox underscores the importance of the context in data analysis. Data aggregation without understanding subgroup dynamics can hide from us important subgroup-level relationships, and lead us to conclusions and decisions that don’t align with what is really happening.


✨Thank you for reading!✨


Loved this post?

💌 Join me on Substack or LinkedIn ☕, or Buy me a coffee!

or, take a look at my other water cooler small talks:

Water Cooler Small Talk: Why Does the Monty Hall Problem Still Bother Us? 🐐🚗A _look at the counterintuitive mathematics of game show puzzlest_owardsdatascience.com

Water Cooler Small Talk: The Birthday Paradox 🎂🎉A l_ook at the counterintuitive mathematics of shared birthdaystow_ardsdatascience.com


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles