How to identify outliers in data with Python

Nassim Taleb writes how "tail" events define a large part of the success (or failure) of a phenomenon in the world.

Everybody knows that you need more prevention than treatment, but few reward acts of prevention. N. Taleb – The Black Swan

A tail event is a rare event, the probability of which is on the tail of the distribution, on the left or right.

https://www.researchgate.net/figure/A-normal-distribution-curve-with-its-two-tails-Note-that-an-observed-result-is-likely-to_fig2_50196301

According to Taleb, we live our lives focusing primarily on the most plausible events, those that are most likely to happen. By doing this, we are not preparing ourselves to deal with the rare events that might happen.

When rare events happen (especially the negative ones), they take us by surprise and our usual actions that we typically take have no effect.

Just think of our behavior when a rare event occurs, such as the bankruptcy of the FTX cryptocurrency exchange, or a powerful earthquake that disrupts the territory. For those directly involved, the typical reaction is panic.

Anomalies are present everywhere, and when we draw a distribution and its probability function we are actually obtaining useful information to protect ourselves or to implement strategies for these tail events, should they occur.

It is therefore necessary to inform ourselves on how to identify these anomalies, and above all to be ready to act in cases where they are observed.

In this article, we will focus on the methods and techniques used to identify outliers (the mentioned anomalies) in data. In particular, we will explore data visualization techniques and the use of descriptive statistics and statistical testing.

The definition of outlier

An outlier is a value that deviates significantly from the other values in the dataset. This deviation can be numerical or even categorical.

For example, a numeric outlier is when we have one value that is much larger or much smaller than most other values within the dataset.

A categorical outlier, on the other hand, occurs when we have labels known as "other" or "unknown" that represent a much higher proportion of the other labels within the dataset.

Outliers can be caused by measurement errors, input errors, transcription errors or simply by data that does not follow the normal trend of the dataset.

In some cases, outliers can be indicative of broader problems in the dataset or the process that produced the data and can offer important insights to the people who developed the data collection process.

Techniques to help us identify outliers in a dataset

There are several techniques that we can use to identify outliers in our data. These are the ones we will touch upon in this article

data visualization: which allows you to identify anomalies by looking at the distribution of data by making use of graphs useful for this purpose
use of descriptive statistics, such as the interquartile range
use of z-scores
use of clustering techniques: which allows to identify groups of similar data and to identify any "isolated" or "unclassifiable" data

each of these methods is valid for identifying outliers, and should be chosen based on our data. Let’s see them one by one.

Data visualization

One of the most common techniques for finding anomalies is through exploratory data analysis and particularly with data visualization.

Using Python, you can use libraries like Matplotlib or Seaborn to visualize the data in such a way that you can easily spot any anomalies.

For example, you can create a histogram or boxplot to visualize the distribution of your data and spot any values that deviate significantly from the mean.

The anatomy of the boxplot can be understood from this Kaggle post.

https://www.kaggle.com/discussions/general/219871

If you want to read more about how to perform exploratory data analysis (EDA), read this article 👇

Exploratory Data Analysis in Python – A Step-by-Step Process

Use of descriptive statistics

Another method of identifying anomalies is the use of descriptive statistics. For example, the interquartile range (IQR) can be used to identify values that deviate significantly from the mean.

The interquartile range (IQR) is defined as the difference between the third quartile (Q3) and the first quartile (Q1) of the dataset. Outliers are defined as values outside the IQR range multiplied by a coefficient typically of 1.5.

The previously discussed boxplot is just one method that uses such descriptive metrics to identify anomalies.

An example in Python for identifying outliers using interquartile range is as follows:

import numpy as np

def find_outliers_IQR(data, threshold=1.5):
    # Find first and third quartiles
    Q1, Q3 = np.percentile(data, [25, 75])
    # Compute IQR (interquartile range)
    IQR = Q3 - Q1
    # Compute lower and upper bound
    lower_bound = Q1 - (threshold * IQR)
    upper_bound = Q3 + (threshold * IQR)
    # Select outliers
    outliers = [x for x in data if x < lower_bound or x > upper_bound]
    return outliers

This method calculates the first and third quartiles of the dataset, then calculates the IQR and the lower and upper bounds. Finally, identify outliers as those values that are outside the lower and upper thresholds.

This handy function can be used to identify outliers in a dataset and can be added to your toolkit of utility functions in almost any project.

Use of z-scores

Another way to spot anomalies is through z-scores. Z-scores measure how much a value deviates from the mean in terms of standard deviations.

The formula for converting data to z-scores is as follows:

where x is the original value, μ is the dataset mean, and σ is the dataset standard deviation. The z-score indicates how many standard deviations the original value is from the mean. A z-score value greater than 3 (or less than -3) is usually considered an outlier.

This method is particularly useful when working with large datasets and when you want to identify anomalies in an objective and reproducible way.

In Sklearn in Python, the conversion to z scores can be done like this

from sklearn.preprocessing import StandardScaler

def find_outliers_zscore(data, threshold=3):
    # Normalize data
    scaler = StandardScaler()
    standardized = scaler.fit_transform(data.reshape(-1, 1))
    # Select outliers
    outliers = [data[i] for i, x in enumerate(standardized) if x < -threshold or x > threshold]
    return outliers

Use of clustering techniques

Finally, clustering techniques can be used to identify any "isolated" or "unclassifiable" data. This can be useful when working with very large and complex datasets, where data visualization is not enough to spot anomalies.

In this case, one option is to use the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm, which is a clustering algorithm that can identify groups of data based on their density and locate any points that don’t belong to any clusters. These points are considered as outliers.

The DBSCAN algorithm can again be implemented with Python’s sklearn lib.

Take this visualized dataset for example

The DBSCAN application provides this visualization

The code to create these charts is as follows

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

def generate_data_with_outliers(n_samples=100, noise=0.05, outlier_fraction=0.05, random_state=42):
    # Create random data
    X = np.concatenate([np.random.normal(0.5, 0.1, size=(n_samples//2, 2)),
                         np.random.normal(1.5, 0.1, size=(n_samples//2, 2))], axis=0)

    # Add outliers
    n_outliers = int(outlier_fraction * n_samples)
    outliers = np.random.RandomState(seed=random_state).rand(n_outliers, 2) * 3 - 1.5
    X = np.concatenate((X, outliers), axis=0)

    # Add noise to the data to resemble real-world data
    X = X + np.random.randn(n_samples + n_outliers, 2) * noise

    return X

# Genereate data
X = generate_data_with_outliers(outlier_fraction=0.2)

# Apply DBSCAN to cluster the data and find outliers
dbscan = DBSCAN(eps=0.2, min_samples=5)
dbscan.fit(X)

# Select outliers
outlier_indices = np.where(dbscan.labels_ == -1)[0]

# Visualize
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, cmap="viridis")
plt.scatter(X[outlier_indices, 0], X[outlier_indices, 1], c="red", label="Outliers", marker="x")
plt.xticks([])
plt.yticks([])
plt.legend()
plt.show()

This method creates a DBSCAN object with the parameters eps and min_samples and fits it to the data. Then identify outliers as those values that don’t belong to any cluster, i.e. those that are labeled as -1.

This is just one of many clustering techniques that can be used to identify anomalies. For example, a method based on deep learning relies on autoencoders particular neural networks that exploit a compressed representation of the data to identify distinctive features in the input data.

Conclusion

In this article we have seen several techniques that can be used to identify outliers in data.

We talked about data visualization, the use of descriptive statistics and z-scores, and clustering techniques.

Each of these techniques is valid and should be chosen based on the type of data you are analyzing. The important thing is to remember that identifying outliers can provide important information to improve data collection processes and to make better decisions based on the results obtained.

If you want to support my content creation activity, feel free to follow my referral link below and join Medium’s membership program. I will receive a portion of your investment and you’ll be able to access Medium’s plethora of articles on data science and more in a seamless way.

Join Medium with my referral link – Andrea D’Agostino

Useful Links (written by me)

Learn how to perform a top-tier Exploratory Data Analysis in Python: Exploratory Data Analysis in Python – A Step-by-Step Process
Learn the basics of TensorFlow: Get started with TensorFlow 2.0 – Introduction to deep learning
Perform text clustering with TF-IDF in Python: Text Clustering with TF-IDF in Python

How to identify outliers in data with Python

The definition of outlier

Techniques to help us identify outliers in a dataset

Data visualization

Use of descriptive statistics

Use of z-scores

Use of clustering techniques

Conclusion

Recommended Reads

Useful Links (written by me)

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function

Must-Know in Statistics: The Bivariate Normal Projection Explained