Publish AI, ML & data-science insights to a global community of data professionals.

The Price of Gold: Is Olympic Success Reserved for the Wealthy?🥇

Analyzing 30 years of Olympic Games medals distribution and national wealth indicators

Image generated by the author using GPT-4o
Image generated by the author using GPT-4o

A lifetime ago, I was something of an athlete myself – when I was a little girl I did a little bit of rhythmic gymnastics. During the Athens 2004 Olympics, I remember sitting in front of the TV in awe for endless hours to watch each and every performance of rhythmic gymnastics, with Russia’s two athletes winning both gold and silver medals.

To this day, I still enjoy very much watching all kinds of gymnastics – I did enjoy every second of Simone Biles’s performances, winning a bunch of medals in Paris 2024 Olympics. Nevertheless, what didn’t ring a bell then, but does now, is how common it is to see rich and powerful countries winning a ton of medals in such sports events. It is really surprising when an athlete from a not-so-well-off country makes it to the medal podium. Think for instance of Algeria’s Imane Khelif in boxing, or Pakistan’s Arshad Nadeem in javelin throwing, both winning gold medals in their sports in a rather unexpected turn of events.

Is this all it takes to win an Olympic medal – being born in a wealthy and powerful country? Most definitely not – it certainly takes a lot of effort, dedication, pain, sacrificing, and tolerating a bunch of things, to the point a non-professional athlete cannot even imagine.

🍨 DataCream is a newsletter offering data-driven articles and perspectives on data, tech, AI, and ML. If you are interested in these topics subscribe here.

But does it help at all? Oh, I bet it does. Around 2010, when I was about 15 years old, I even got to be a member of the group rhythmic gymnastics Greek team. Since the Greek economy was in the middle of a major recession at the time, this was a rather grounding experience – the gymnastics federation owed more than a year of wage money to the coaches, there was no heating or air conditioning in the gym, supporting facilities like showers and changing rooms were a mess, and doctors and physical therapists helped the athletes literally out of the goodness of their hearts. As you can imagine, I didn’t get a single penny for my time there.💸 After spending some years in gymnastics hell, I was more than motivated to dedicate myself to getting a solid education and becoming a STEM girlie.

Your author on the rhythmic gymnastics mat, back in the day 🎀  (Image by author)
Your author on the rhythmic gymnastics mat, back in the day 🎀 (Image by author)

But not everyone gets a STEM education alternative, or any alternative at all. Not all athletes get to become superstars like Simone Biles or outliers like Arshad Nadeem. In fact, many athletes just get to work with what limited resources they have, and if they are truly phenomenal at their sport, they get some kind of commercial sponsorship to pay the bills. Nevertheless, what the majority gets in a country like Greece is just unfulfilled potential – what could have been but never really was or will be. Certainly, money cannot be a substitute for talent or hard work, but it really can help athletes in so many ways: coaching, facilities, healthcare, being a truly full-time athlete – anything really.

So in this post, I gathered some data about Olympic medals and country indicators, and tried to find out:

Is success in the Olympic Games correlated with national wealth indicators?

Let’s see! 🤸🏻 ‍♀️

Are Olympic champions born in wealth?

Setting up the environment 💻

As usually, the first thing we need to do is open a new Jupyter Lab notebook – I will be using Jupyter Lab because it plays out well with Plotly visualization library. Before everything else, we need to ensure that the essential libraries are installed. These will be scikit-learn (part of sklearn) for machine learning models, and Plotly for visualizations, and we can easily install them using pip:

pip install plotly scikit-learn

Next, we import these libraries into our Jupyter Lab notebook, along with pandas, sqlite3, numpy and math, which are going to assist in our analysis:

import sqlite3
import pandas as pd
import numpy as np
import math

import plotly.express as px
import plotly.graph_objects as go

from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

Getting the data together‍🍳

I will be utilizing this Kaggle dataset which contains info for the number of medals won per country for all Olympic Games from 1994 to 2024. The dataset is licensed under Apache 2.0, which allows commercial use.

The data is organized in a SQLite database with information for each Olympic Games event being stored in a separate table. We can easily access the database and the respective tables in Python by:

connection = sqlite3.connect('Olympic_Games_(1994-2024).db')
table_query = "SELECT name FROM sqlite_master WHERE type='table';"
tables = pd.read_sql_query(table_query, connection)

Nonetheless, the provided dataset does not include the info about whether each event was a Summer or Winter Olympic Games, thus I tried to add this info manually.

# Years for Summer and Winter Olympics
summer_years = {
'1996', '2000', '2004', '2008', '2012', '2016', '2020', '2024'
}
winter_years = {
 '1994', '1998', '2002', '2006', '2010', '2014', '2018', '2022'
}

… and then I combine everything into a single dataframe…

all_tables_list = []

# Loop through all tables and load each into a dataframe
for table_name in tables['name']:

    query = f"select * from {table_name}"
    df = pd.read_sql_query(query, connection)
    parts = table_name.split('_')
    city = parts[0]  
    year = next((part for part in parts if part.isdigit() and len(part) == 4), None)  # Find the first 4-digit year

    if year in summer_years:
        season = 'Summer'
    elif year in winter_years:
        season = 'Winter' 

    df['City'] = city
    df['Year'] = year
    df['Season'] = season
    df['initial_table_name'] = table_name

    # Append dataframe to the list
    all_tables_list.append(df)

combined_df = pd.concat(all_tables_list, ignore_index=True)
connection.close()

We can then take a glimpse into the dataframe we have just created by:

combined_df.head()
Image by author
Image by author

I will also utilize the GDP, GDP per capita, population size and Poverty Headcount Ratio per country as provided by the World Bank Open Data, licensed under CC-BY 4.0.

GDP (Gross Domestic Product) is the total monetary value of all goods and services produced within a country’s borders over a specific period, and is considered a key indicator of a country’s economy.

Respectively, GDP per capita represents the average economic output per person in a given country. In this way it provides an approximate measure of the standard of living and the economic productivity in a country, nonetheless, it does not account for income inequalities.

Poverty Headcount Ratio **** expresses the percentage of the population of a country living on less than a certain amount per day. In analyzing the Poverty Headcount Ratio, I will utilize indices at $2.15, $3.65, and $6.85 at 2017 international prices. These thresholds help to effectively distinguish between extreme poverty and the various levels of economic hardship faced by middle-income countries:

  • $2.15 per day: Represents extreme poverty, where individuals struggle to meet basic survival needs.
  • $3.65 per day: Reflects a higher level of poverty, often used to assess poverty in lower middle-income countries.
  • $6.85 per day: Indicates poverty among upper middle-income countries, capturing those facing economic challenges but not in extreme poverty.

To do this, I separately downloaded the historical data for the below indices in XLS format and formatted them into a single XLS file, with each of the below indicators formatted as a separate sheet.

Next, I loaded the XLS into my notebook:

# Load country indicator data
file_path = 'country_indicators.xlsx'

gdp_df = pd.read_excel(file_path, sheet_name='GDP')
gdp_pc_df = pd.read_excel(file_path, sheet_name='GDP_PC')
pop_df = pd.read_excel(file_path, sheet_name='Population')
phr_685_df = pd.read_excel(file_path, sheet_name='PHR_6.85')
phr_365_df = pd.read_excel(file_path, sheet_name='PHR_3.65')
phr_215_df = pd.read_excel(file_path, sheet_name='PHR_2.15')

I also did the dirty job of mapping the National Olympic Committee (NOC) codes of the medals dataset to the 3-letter ISO country codes of the economic indicators dataset.

NOCs are three-letter abbreviations that characterize each group of athletes that participate in the Olympic Games. Usually, the NOC is the same as the 3-letter ISO country code, but this is not always the case. Some NOCs represent territories that are not fully recognized as independent countries. For example, Puerto Rico, Hong Kong, and Guam have their own NOCs and compete separately in the Olympics, even though they are territories of the United States and China respectively. Even for the NOCs that correspond to an actual UN country, the NOC code and the country code are not always the same 🤷‍♀️. For instance, Germany’s NOC is ‘GER’, but the country code is ‘DEU’.

So, I did my best to match the NOCs with the respective country codes. This is the mapping I used – for the rest of the countries I considered the NOC to be identical to the 3-letter ISO country code.

noc_to_iso_mapping = {
    'GER': 'DEU',  # Germany
    'SUI': 'CHE',  # Switzerland
    'NED': 'NLD',  # Netherlands
    'TPE': 'TWN',  # Chinese Taipei (Taiwan)
    'POR': 'PRT',  # Portugal
    'MAS': 'MYS',  # Malaysia
    'BRN': 'BHR',  # Bahrain
    'BUR': 'BFA',  # Burkina Faso
    'GRE': 'GRC',  # Greece
    'URU': 'URY',  # Uruguay
    'GRN': 'GRD',  # Grenada
    'SLO': 'SVN'   # Slovenia
}

And finally, I merged everything into a single dataframe. To do this, I consider that the count of medals at each Olympic event should be associated with the country indicators of the previous year. For instance, the 2024 Paris Olympics are associated with country indicators for 2023. For the correlation analysis between Olympic medals and country indicators, I will only be focusing in Summer Olympics.

# List of dataframes and corresponding value names
dataframes = {
    'gdp_df': 'GDP',
    'gdp_pc_df': 'GDP per Capita',
    'pop_df': 'Population',
    'phr_685_df': 'Poverty Headcount Ratio at $6.85 a day',
    'phr_365_df': 'Poverty Headcount Ratio at $3.65 a day',
    'phr_215_df': 'Poverty Headcount Ratio at $2.15 a day'
}

# For each Olympic year, map with the country indicators of the previous year
years_mapping = {str(int(year) - 1): year for year in summer_years}

# Loop through each dataframe
for df_name, value_name in dataframes.items():
    df = globals()[df_name] 
    df = df[['Country Code'] + list(years_mapping.keys())].copy()  # Select relevant columns
    df.rename(columns=years_mapping, inplace=True)  

    # Melt
    df_melted = df.melt(id_vars=['Country Code'], var_name='Year', value_name=value_name)

    # Add NOC using the NOC/ Coundty Code mapping
    df_melted['NOC'] = df_melted['Country Code'].map(noc_to_iso_mapping).fillna(df_melted['Country Code'])

    globals()[df_name + '_melted'] = df_melted  

# Merge everything into a single dataframe
combined_data = combined_df.copy()
melted_dfs = ['gdp_df_melted', 'gdp_pc_df_melted', 'pop_df_melted', 'phr_685_df_melted', 'phr_365_df_melted', 'phr_215_df_melted']

for melted_df_name in melted_dfs:
    melted_df = globals()[melted_df_name] 
    value_name = dataframes[melted_df_name.replace('_melted', '')] 

    combined_data = pd.merge(
        combined_data,
        melted_df[['NOC', 'Year', value_name]], 
        left_on=['NOC', 'Year'],
        right_on=['NOC', 'Year'],
        how='left'
    )
combined_data = combined_data.dropna().reset_index(drop = True)

country_code_mapping = gdp_df[['Country Code', 'Country Name']].drop_duplicates().set_index('Country Code')['Country Name'].to_dict()
combined_data['Country Code'] = combined_data['NOC'].map(noc_to_iso_mapping).fillna(combined_data['NOC'])
combined_data['Country Name'] = combined_data['Country Code'].map(country_code_mapping)

Our dataframe should be looking like this:

Image by author
Image by author

Who’s winning all the Olympic medals?🏅

Now we have prepared the data, let’s take a look at the distribution of medals in this year’s Olympics – Paris 2024 Olympic Games. We can easily create a stacked bar Plotly chart by:

# Filter 2024 Paris Olympics 
df_2024_paris = combined_df[(combined_df['Year'] == '2024') & (combined_df['City'].str.lower() == 'paris')]

# Sort by total number of medals and only keep top 20 
top_20_countries = df_2024_paris.sort_values(by='Total', ascending=False).head(20)

# Melt
df_melted = top_20_countries.melt(id_vars=['NOC'], value_vars=['Gold', 'Silver', 'Bronze'], 
                                  var_name='Medal Type', value_name='Count')

# Custom fancy colors per medal type
custom_colors = {
    'Gold': 'GoldenRod',
    'Silver': 'Silver',
    'Bronze': 'Lightsalmon'
}

# Plotly stacked bar chart 
fig = px.bar(
    df_melted,
    x='NOC',
    y='Count',
    color='Medal Type',
    color_discrete_map=custom_colors,
    title='Top 20 Countries by Total Medal Counts for 2024 Olympics in Paris',
    labels={'Count': 'Medal Count', 'NOC': 'NOC'},
)
fig.show()
Image by author
Image by author

Next, I took a look at the countries that historically won the most medals. To effectively visualize this data, I plotted the top 5 countries historically winning the most medals in a bump chart. In particular, a bump chart displays the ranking of each variable at a specific point in time. Thus, it allows us to easily communicate which countries are historically dominating the Olympic medal podium. To create a bump chart in Plotly, we are going to use the following function:

def create_bump_chart_with_ranks(df, season, olympic_years, top_n=5):

    # Convert Year to numeric
    df['Year'] = pd.to_numeric(df['Year'])

    # Calculate total medals per NOC
    total_medals_df = df.groupby('NOC')['Total'].sum().reset_index()

    # Calculate ranks for top countries
    top_countries = total_medals_df.nlargest(top_n, 'Total')
    top_df = df[df['NOC'].isin(top_countries['NOC'])]
    top_df['Rank'] = top_df.groupby('Year')['Total'].rank(ascending=False, method='dense')
    top_df = top_df[top_df['Rank'] <= top_n]

    # Sort by year
    top_df = top_df.sort_values('Year')
    fig = go.Figure()

    # Add lines and markers for each country
    for country in top_df['NOC'].unique():
        country_df = top_df[top_df['NOC'] == country]
        fig.add_trace(go.Scatter(
            x=country_df['Year'],
            y=country_df['Rank'],
            mode='lines+markers+text',
            text=country_df['Rank'],
            textposition='top center',
            marker=dict(size=10, line=dict(width=2)),
            name=country
        ))

    fig.update_yaxes(autorange='reversed')

    # Adjust layout 
    fig.update_layout(
        title=f'Rank Changes Over Time for Top {top_n} Countries - {season} Olympics',
        xaxis_title='Year',
        yaxis_title='Rank (1 is highest)',
        yaxis=dict(tickmode='linear', tick0=1, dtick=1),
        xaxis=dict(
            type='linear',
            tickmode='array',
            tickvals=sorted(list(olympic_years)),  # Convert the set to a sorted list for tick values
            ticktext=[str(year) for year in sorted(list(olympic_years))]
        ),
        showlegend=True
    )
    fig.show()

Then, I plotted the Summer and Winter Olympics results separately:

summer_df = combined_df.loc[combined_df['Season'] == 'Summer'].copy() 
create_bump_chart_with_ranks(summer_df, "Summer", summer_years)
Image by author
Image by author

The bump chart reveals that the USA consistently holds the top rank in the Summer Olympics. China (CHN) and Russia (RUS) are also consistently strong players, frequently for the second and third positions, until Russia’s ban from the event in 2022.

winter_df = combined_df.loc[combined_df['Season'] == 'Winter'].copy() 
create_bump_chart_with_ranks(winter_df, "Winter", winter_years)
Image by author
Image by author

In the Winter Olympics, Germany (GER), USA, and Norway (NOR), seem to be the dominant countries.

Are all the winners rich? 💰

First of all, I took a look into the correlation between the total number of medals and the different country indicators.

corr_df = combined_data[['Total', 'GDP', 'GDP per Capita', 'Population',
                         'Poverty Headcount Ratio at $6.85 a day',
                         'Poverty Headcount Ratio at $3.65 a day',
                         'Poverty Headcount Ratio at $2.15 a day']]

# Calculate correlation with country indicators
correlations = corr_df.corr()['Total'][1:] 
print(correlations)
Image by author
Image by author

There is a strong positive correlation between GDP and the total number of medals, indicating that countries with higher GDP tend to win more medals, likely due to greater resources available for training, facilities, and athlete development. On top of this there is also a moderate positive correlation between population size and the total number of medals. One explanation for this would be that countries with larger populations may have a greater pool of potential athletes, which can contribute to winning more medals. Additionally, there are weak negative correlations with all three Poverty Headcount Ratios indicating that higher poverty rates may be associated with winning fewer medals.

Due to the high correlation of total number of medals with the country’s GDP I considered it interesting to also run a regression model, and see how it fits the data.

features = ['GDP']
target = 'Total'

X = combined_data[features]
y = combined_data[target]

# Split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Lasso Regression model
lasso_model = Lasso(alpha=0.1) 
lasso_model.fit(X_train, y_train)
y_pred = lasso_model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f'MSE: {mse}')
print(f'RMSE: {rmse}')
print(f'R2: {r2}')
Image by author
Image by author

R2 score of 0.71 indicates a moderately strong relationship between GDP and the number of medals won. The MSE of 205.82 and RMSE of 14.35 suggest that while the model captures the overall trend, there is some variability not explained by GDP alone. Nonetheless, GDP is undeniably a strong driver for the total number of medals won by each country,

We can also plot the regression model results in Plotly:

# Create scatter plot
fig = px.scatter(combined_data, x='GDP', y='Total', trendline='ols')

# Adjust layout
fig.update_layout(
    title='Relationship between Total Medals and GDP_total',
    xaxis_title='GDP_total',
    yaxis_title='Total Medals'
)
fig.data[1].update(line=dict(color='red', dash='dot'))
fig.show()
Image by author
Image by author

Finally, in an effort to understand why the Poverty Headcount Ratios don’t have stronger correlations with the total number of medals, I visualized them in a scatter plot. In particular, Poverty Ratio at $6.85 is visualized in x- axis of the chart, Poverty Ratio at $3.65 is illustrated by the bubble size, and poverty Ratio at $2.15 is shown by the color scale.

hover_template = (
    "Country Name: %{customdata[1]}<br>"
    "Total Medals: %{y}<br>"
    "Year: %{customdata[0]}<br>"
    "City: %{customdata[2]}<br>"
    "Poverty Ratio at $6.85/day: %{x}<br>"
    "Poverty Ratio at $3.65/day: %{marker.size}<br>"
    "Poverty Ratio at $2.15/day: %{marker.color}<extra></extra>"
)
custom_data = combined_data[['Year', 'Country Name', 'City']]

# Create a bubble chart 
fig = px.scatter(
    combined_data,
    x='Poverty Headcount Ratio at $6.85 a day',
    y='Total',
    size='Poverty Headcount Ratio at $3.65 a day',  
    color='Poverty Headcount Ratio at $2.15 a day', 
    title='Bubble Chart: Medals vs. Poverty Headcount Ratios',
    labels={
        'Poverty Headcount Ratio at $6.85 a day': 'Poverty Ratio at $6.85/day',
        'Poverty Headcount Ratio at $3.65 a day': 'Poverty Ratio at $3.65/day',
        'Poverty Headcount Ratio at $2.15 a day': 'Poverty Ratio at $2.15/day',
        'Total': 'Total Medals',
    },
    size_max=30,  
    custom_data=custom_data 
)
fig.update_traces(hovertemplate=hover_template)
fig.update_layout(
    width=1400,   
    height=600    
)
fig.show()
Image by author
Image by author

Interestingly, the plot reveals that China and Russia, which have historically excelled in the Olympics, also exhibited high poverty headcount ratios during a significant number of Olympic events. The plot also shows that countries with extreme poverty rates, such as Ethiopia or Uzbekistan, have occasionally managed to secure multiple medals.

On my mind

So, is Olympic success reserved for the wealthy? Oh, I think that for the most part it is. The correlation analysis clearly indicates that a country’s GDP is an undeniably major factor driving the performance of the country in the Olympic Games. Wealthier countries typically have greater resources to invest in sports infrastructure, training facilities, and athlete development programs. However, other factors also play a crucial role. For instance, countries like Russia or China, despite higher poverty levels, seem to prioritize state investment in sports programs, facilities, and athlete training as part of national pride and international prestige. On top of this, specific sports, like long-distance running for Ethiopia or wrestling for Uzbekistan, can provide opportunities for countries with very limited resources to excel due to cultural traditions, or natural talent pools. Nevertheless, GDP remains a significant driver that cannot be overlooked.


✨Thank you for reading!✨


Loved this post? Let’s be friends!

💌 Join me on Substack or LinkedIn

💼 Hire me on Upwork!

Buy me a coffee!

or take a look at my other data science tutorials:

From Data to Dashboard: Visualizing the Ancient Maritime Silk Road with Dash Leaflet and SeaRoute…

Was Michael Scott the World’s Best Boss?


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles