Sports Analytics

Intro
A few days ago I posted my first sports analytics post. Feeling totally attracted to the topic still, here I am again writing about football.
In that post – linked below – I used frequentist stats to demonstrate the randomness of goal events. But I took it further. The random model explained there – influenced by the Poisson distribution – is applicable in many other fields unrelated to football.
Today we’ll move one step forward and, even though it will be football-centered, the process and knowledge we’ll be going through will be relevant for any data scientist.
Football-wise, we’ll focus on defense and try to analyze Barça’s to see where it could have gone better, both on a team and individual level.
As defense is a broad term – it includes tackles, saves, blocks, and many other advanced stats – I’ll be more concrete and focus solely on shots and goals conceded.
In the 2015–16 La Liga, Barça was the second team to concede fewer goals (29), right after Atlético (18). Even though that’s not bad at all, there’s still room for improvement.
The goal is not to provide solutions, that’s the coaching staff’s work. Our goal today as data scientists or sports analysts is to find the problems and hypothesize so that the staff can take this info and solve the problems on the pitch.
So here’s a brief summary of what we’re going to go through today:
- Background and Context.
- Get the data, transform, and prepare it.
- Analyze shots against and goals conceded by FCB.
- Go even deeper by checking shoots and goals conceded on a player level.
As all the code will be available on my GitHub repo [1], linked in the resources section, I will be skipping some of it here to avoid having big code snippets distracting readers from the content itself.
Context
No data science problem can be solved if we don’t have the context. We need to deeply understand the data we’re working with. If not, we cannot draw useful conclusions.
The data needs a context to become information.
So let us go back in time. The 2015–16 La Liga campaign was an interesting one. F.C. Barcelona won with 91 points followed by Real Madrid just one point behind (90) and Atlético de Madrid with 88.
The last match day was going to determine it all. But all three won their respective games so the standings didn’t change. Barça won the title.
In that season, Barça went on to win La Copa del Rey and the UEFA Super Cup too but failed miserably in the Champions League and Supercopa de España. In the first one, they got eliminated by Atlético de Madrid in the quarter-finals, and in the second one, they lost by an aggregate 5–1 against Athletic Club.
Clearly, Barça’s defense had issues. In that La Liga campaign, they got scored 11 times more than Atlético de Madrid (that’s a 61% increase). It looks like the MSN offensive (Messi-Suarez-Neymar) compensated, in most games, the team’s defense.
But miracles don’t exist.
The 4 usual defensive starters were Alves, Piqué, Mascherano, and Alba. These were world-class defenders, but they obviously didn’t play all minutes. Substitutes were Mathieu, Roberto, Adriano, Bartra, Vermaelen… The level dropped quite a bit if we looked at the bench.
Can we blame the substitutes? Maybe not. Let me point out that both Alves and Alba were pretty offensive full-backs… Did that make their team concede more shots or goals?
Our task today is to analyze the defense and see if we can find any potential flaws.
Get the Data, Transform and Prepare It
No real-world raw data will ever be cleaned and prepared to be used by a data scientist. Most of the time – or code – will be used to treat the dataset and transform it to obtain the exact data we need for our project.
For today’s purposes, we’ll be using StatsBomb’s free open data [2] for this post. And we’ll also be using the statsbombpy[3] module to tinker with it.
You can just install it by running:
pip install statsbombpy
And we’re going to be using a few more modules that you might need to also install if you haven’t already. Then, import them:
import matplotlib
import matplotlib.pyplot as plt
from mplsoccer import VerticalPitch
import numpy as np
import pandas as pd
from statsbombpy import sb
We’re now ready to retrieve the data. As we need to inspect shots and goals against FCB in 2015–16, we first need to get all the matches from that competition and season:
competition_row = sb.competitions()[
(sb.competitions()['competition_name'] == 'La Liga')
& (sb.competitions()['season_name'] == '2015/2016')
]
competition_id = pd.unique(
competition_row['competition_id']
)[0]
season_id = pd.unique(
competition_row['season_id']
)[0]
matches = sb.matches(competition_id=competition_id, season_id=season_id)
Then, for each match, we can easily retrieve its events by using:
match_events = sb.events(match_id=match_id)
So we extracted all events from all Barça’s matches and created a data frame containing all the events: all_events. We also created two additional columns because they would be useful for the latter part of the analysis:
- One called
minutes, which had the same value for all rows within the same match: its duration. - The other one is called
time, which wasn’t more than the concatenation of the values in the columnsminuteandsecond.
Finally, the all_events data frame got filtered to keep only shots made by the opponent team. The result:
shots_against_team.head()

StatsBomb’s data is amazing. Complete and accurate. What’s useful for us today is the location where the shot happened, outlined by the columns x and y. So we can use that to visualize them.
Analyze shots and goals conceded by FCB
Once we have all the data we need, we can start our analysis. As usual, plotting is the first thing I do because it’s the best way to understand the data we’re working with.
We’ll be using the VerticalPitch class from the mplsoccer[4] module to show where shots and goals come from:
# Set up pitch (layout)
pitch = VerticalPitch(line_zorder=2, line_color='black', half = True)
fig, axs = pitch.grid(nrows=1, ncols=1, axis=False, endnote_height=0.05)
# Plot each shot
for row in shots_against_team.itertuples():
if row.shot_outcome == 'Goal':
# If it was a goal, we want to see it clearly
alpha = 1
else:
# Increase transparency if it wasn't a goal
alpha = 0.2
pitch.scatter(
row.x,
row.y,
alpha = alpha,
s = 100,
color = "red",
ax=axs['pitch'],
edgecolors="black"
)
This simple code allows us to plot half of a football pitch and place the shots in red, with varying transparency depending on whether it was a goal or not. Additionally, I added two extra dots for the average shot and goal positions (green and blue, respectively).

Both averages are quite centered (a little bit deviated towards the left) and goal positions are closer to the goal than the shots average. It will come as no surprise that, the closer the shot, the easier it is to score.
Moving on to the shots, there’s just one thing we cannot avoid looking at: when outside the penalty area, shots seem to come more from the right side (though it’s pretty even). But inside the area, shots are clearly skewed to the left.
That’s Alves and Piqué’s side.
If we focus on goals, the ones from the left side are more spread while the ones on the right half of the plot look more centered or clustered.
The code snippet below is used to create a shot heatmap. If we apply the same to goals, we get the heatmap used to better illustrate the shots and goals distribution across different areas of the pitch.
pitch = VerticalPitch(line_zorder=2, line_color='black', half = True)
fig, axs = pitch.grid(nrows=1, ncols=2, axis=False, endnote_height=0.05)
shot_bin_statistic = pitch.bin_statistic(
shots_against_team.x,
shots_against_team.y,
statistic='count',
bins=(6, 5),
normalize=False
)
#normalize by number of games
shot_bin_statistic["statistic"] = shot_bin_statistic["statistic"]/len(team_matches)
#make a heatmap
pcm = pitch.heatmap(shot_bin_statistic, cmap='Reds', edgecolor='grey', ax=axs['pitch'][0])
#legend to our plot
ax_cbar = fig.add_axes((-0.05, 0.093, 0.03, 0.786))
cbar = plt.colorbar(pcm, cax=ax_cbar)
axs['pitch'][0].set_title('Shots conceded heatmap')
fig.suptitle(f"Shots and Goals Against {team} in 2015/16 La Liga season", fontsize = 30)
And if we plot both heatmaps, we get:

Now, obviously, most shots and goals come from the center and the closest part of the goal. Once the ball is there, only shooting makes sense.
The shot heatmap is almost perfectly symmetric, that’s because we split it into 5 bins or columns. If we had chosen more, we’d probably see the left skewness better.
But we can appreciate it in the goal heatmap. Barça conceded more goals from the Alves-Piqué’s side than from Mascherano-Alba’s.
When analyzing a team’s defense, several things matter. The ultimate goal is to see where they failed and how we can reduce the number of goals conceded. A team can do it in many ways, but reducing the number of shots conceded is clearly a way to mitigate the number of goals you receive.
However, we haven’t linked shots to goals yet today. And we’re going to do so by building a new KPI, today’s first one.
Key Performance Indicators, or KPIs, are extremely important in any data science or analysis project. Selecting the proper KPIs will let us assess and properly evaluate the strategies we’re using, while doing the opposite – selecting inaccurate ones – will result in useless and misleading analyses.
Our first KPI today will be the goal-per-shot ratio, as a way to inspect where rivals have more chances of scoring when shooting:
# Count goals per heatmap bin
goal_bin_statistic = pitch.bin_statistic(
shots_against_team.loc[shots_against_team['shot_outcome'] == 'Goal'].x,
shots_against_team.loc[shots_against_team['shot_outcome'] == 'Goal'].y,
statistic='count',
bins=(6, 5),
normalize=False
# Count shots per heatmap bin
shot_bin_statistic = pitch.bin_statistic(
shots_against_team.x,
shots_against_team.y,
statistic='count',
bins=(6, 5),
normalize=False
)
# Create goal_shot_ratio KPI by dividing goals/shots
goal_shot_ratio = goal_bin_statistic.copy()
goal_shot_ratio['statistic'] = np.divide(goal_bin_statistic['statistic'], shot_bin_statistic['statistic'])
goal_shot_ratio['statistic'] = np.nan_to_num(goal_shot_ratio['statistic'])
Below is this new statistic plotted in a heatmap:

We can clearly see the left skewness we previously mentioned. When the other team shot from Alves and Piqué’s side, within the penalty area, the chance of conceding a goal was almost twice the probability of conceding from Mascherano and Alba’s side.
That doesn’t mean Piqué was worse than Mascherano, I certainly don’t think so. We’re just saying that shooting from the left side led to more goals than shooting from the right side.
Going deeper – Analyzing players
I feel like we’ve been blaming Piqué and Alves and we haven’t even compared their individual numbers yet. For that reason, we’re going to go deeper here and analyze the players’ individual performances, again, in terms of shots and goals conceded.
What if the cause of that lack of symmetry was caused by bench players substituting these world-class starters when tired or injured? What if, on an individual level, their performance in terms of shots and goals was the same?
That’s what we’re going to analyze now, but we first need to prepare the data. We need to know how many minutes a player played in each game, how many goals the team conceded when they were on the pitch…
We’ll use statsbombpy‘s lineups to get that, combined with the previous events data frame we already built:
all_lineups = None
for match_id in pd.unique(all_events['match_id']):
match_lineups = sb.lineups(match_id=match_id)['Barcelona']
match_lineups['match_id'] = match_id
match_lineups['match_duration'] = all_events[all_events['match_id'] == match_id]['minutes'].unique()[0]
match_lineups['from'] = match_lineups['positions'].apply(lambda x: x[0]['from'] if x else np.nan)
match_lineups['to'] = match_lineups.apply(lambda x: x['positions'][-1]['to'] if x['positions'] and x['positions'][-1]['to'] is not None else ('90:00' if x['positions'] else np.nan), axis=1)
match_lineups['minutes_played'] = match_lineups.apply(lambda x: parse_positions(x['positions'], x['match_duration']), axis=1)
if all_lineups is None:
all_lineups = match_lineups.copy()
else:
all_lineups = pd.concat([all_lineups, match_lineups], join="inner")
all_lineups = all_lineups.reset_index(drop=True)
We’re just adding some new useful columns and using a custom function – parse_position() – to parse the number of minutes that player was on the pitch during that match (accounting for extra added time as well).

Now we can move on and count the number of shots and goals the team received when each player was on the pitch.
for match_id in pd.unique(all_lineups['match_id']):
match_shots = shots_against_team[
shots_against_team['match_id'] == match_id
]
for player_tup in all_lineups[all_lineups['match_id'] == match_id].itertuples():
# For whatever reason, the 'from' column is being mapped to '_10'
shots_conceded = match_shots[
(match_shots['time'] >= player_tup._10)
& (match_shots['time'] <= player_tup.to)
]
goals_conceded = len(
shots_conceded[shots_conceded['shot_outcome'] == 'Goal']
)
shots_conceded = len(shots_conceded)
all_lineups.at[player_tup.Index,'shots_conceded'] = shots_conceded
all_lineups.at[player_tup.Index,'goals_conceded'] = goals_conceded
This is how the new columns look on the all_lineups data frame:

This is useful because it allows us to see how the team performed in each game when certain players were on the pitch. Let’s take it further.
Apart from the KPI we created in the previous section, we’re now going to generate two new KPIs:
- Shots per minute
- Goals per minute
grouped = all_lineups.groupby('player_id')[
['minutes_played', 'shots_conceded', 'goals_conceded']
].sum()
grouped['shots_per_minute'] =
grouped['shots_conceded'] / grouped['minutes_played']
grouped['goals_per_minute'] =
grouped['goals_conceded'] / grouped['minutes_played']
If we remove those players whose average played minutes per game is below 10 minutes and sort the data frame by the shots_per_minute variable, here’s what we get:

This couldn’t be more clear. 5 of the top 6 rows were bench defensive players: Bartra, Adriano, Mathieu, Roberto, and Vermaelen. This is crazy.
No wonder why Luis Enrique preferred Piqué, Mascherano, Alves, and Alba.

Translated to shots per 90 minutes, when Bartra played the team would receive an average of 11.31 shots per match. Compare that to the 8.51 shots per 90 minutes when Mascherano played.
When it comes to goals per 90 minutes, we still have Mathieu, Adriano, and Roberto within the top least-performing 6. Clearly not efficient.
We also see Piqué there, and this explains the left skewness we saw on the heatmaps.
But let’s put that into a team context. Let’s see the team ratios:
Barcelona conceded 356 shots. An average of 9.37 shots per match.
Barcelona conceded 26 goals. An average of 0.68 goals per match.
This is relevant:
- All players from the left plot except Vermaelen conceded more shots per match than the team’s average. In other words, these 5 players really underperformed defensively. Being most of them defensive, that means that the team’s defense struggled when they were on the pitch. Luckily, that increase in shots didn’t directly translate into goals.
- If we look at the right plot, all 6 are above the team’s average. The team conceded more goals when they were on the pitch. This time, half of them are bench defensive players.
Conclusions
All data science projects need their takeaways. It’s probably the most important part – drawing insights and sharing them with the stakeholders is the reason why the previous analysis and research are done.
In our football case, here’s how we could end this analysis and share them with Luis Enrique, that season’s coach:
As a team, we ended up with the second-best defense that season. Not bad, but pretty far from the best one (by an 11-goal difference). We conceded more goals than we’d desired, and the best way to reduce that number is by reducing the number of shots Barça let their rivals shoot.
If we focus on those shots and goals, the dangerous ones (within the penalty area) seemed to be skewed toward Piqué’s side. We’re not blaming him, we’d need to see why this happens.
A more in-depth analysis proved that, if we should blame someone, it wouldn’t be the starter defensive players. When on the pitch, the rivals shot less. Whether that translates to fewer goals or not depends on several factors, but the correlation exists.
It’s in fact when players like Vermaelen, Mathieu, Bartra, Adriano, and/or Roberto played that the team received the biggest amount of shots and goals per minute played. So the focus should be, in my opinion, to find better defensive options to support the defensive core built by Alves, Piqué, Mascherano, and Alba, or make the existing ones improve their performances.
Piqué and Mascherano led the defense most of the time and that helped Barça win the trophies they won. But, even if Piqué was the best defender in terms of shots conceded, when rivals shoot they had higher chances of scoring compared to Mascherano. Causality or casualty, we’ll never know (I’d choose the second one).
If you’re curious about what happened during 2016’s transfer window: Adriano, Bartra, and Vermaelen left the club. Mathieu stayed for one more season but his participation dropped quite a bit, playing only in 16 games throughout all the competitions: he was basically the last option.
Roberto was the only one that remained and kept on having a good amount of minutes.
That summer, Barça also tried to reinforce their squad by signing Umititi and Digne – if they only knew how that would end…
So, in a way, today’s analysis isn’t new to FCB: they already performed it at the end of the 2015–16 season and saw the same problems we saw in their defense. Somehow, they weren’t able to sign enough quality players to improve their situation and didn’t win next season’s La Liga.
Thanks for reading the post!
I really hope you enjoyed it and found it insightful.
Follow me and subscribe to my mail list for more
content like this one, it helps a lot!
@polmarin
Resources
[1] FCB Defense Analysis Repo – GitHub





