Which Regression technique should you use?

When you work in a certain field for long enough, there are some classes, concepts, lessons, and teachers that you will always remember.

For example, my mom is a teacher and she remembers the substitute teacher that made her fall in love with philosophy for the first time. My Tae Kwon Do master will always remember the first class of Tae Kwon Do when he was only a kid and the excitement that mounted inside him.

I am a Machine Learning Engineer. Professionally speaking, Machine Learning is the thing I love the most and it’s probably the subject that I know better.

A class that I will always remember is when my first Machine Learning professor during my bachelor’s degree described the difference between classification and regression. An example of a classification task is identifying whether an email is spam or not given its text. An example of a regression task is predicting the price of a house based on its features (e.g. size, location, etc…).

We define a set of features as a matrix (table) X with k columns and n rows. In both the classification and regression tasks, the output is a vector y that has n entries (same as the number of rows of X). The difference is that in classification tasks, the entries of y are integer numbers. Referring to the previous example, y_1=0 means that the first email is not spam and y_1=1 means that the first email is a spam email. In regression tasks, the entries of y are real numbers. Referring to the house price example, y_1 = 123780 means that the price of the house that we aim to predict for house number 1 is 123780.

Now, the regression tasks can be approached in multiple ways. Actually, in A LOT of ways… maybe too many ways to handle. If we have so many methods to solve a single problem, choosing the right way can be very hard. With the explosion of AI and clickbait titles, choosing the best method has become increasingly difficult, as many articles and papers claim to have the ultimate solution (buy my course to get the code for sale!!!!) for every single regression problem.

The truth is that every dataset should be solved with a specific algorithm, depending on the specific properties of the data and the specific requirements that we want to achieve.

This blog post aims to be a user-friendly guide on the best regression task to use based on:

The linearity/polynomial linearity of the dataset
Complexity of the dataset
The dimensionality of the dataset (number of columns)
The need for a probabilistic output

For the sake of this study, we will only consider traditional machine learning methods (no Neural Networks) as we want to mainly focus on small synthetic datasets.

Are you ready to rock’n’roll? 🎸 Let’s get started.

1. Regression Taxonomy

This is a "rule of thumb" taxonomy we can use with regression tasks:

Or, in words:

Is the relationship between X and y linear or polynomial?

Yes:

Is k small? (where k refers to the number of predictors)

Yes: Use Linear/Polynomial Regression.

No: Use Principal Component Regressor.

No:

Is probabilistic output needed?

Yes: Use Gaussian Process Regression (GPR).

No: Use Support Vector Machines (SVM).

During this article, I will describe each one of these methods first using words, then using Python.

2. Linear/Polynomial Regression

Let’s start from the Linear/Polynomial Regression.

2.1 Explanation

This is arguably the simplest case of regression that we have and it happens when we have that the input matrix X is linearly correlated with our wanted target (output) Y.

In other words, if we have our X matrix and we want to find Y, our estimate will be:

Where W is the matrix of the weight and w_0 is the so-called intercept.

There is a lot of debate on whether to consider this method as a Machine Learning one or simply a statistical solution. This is because the optimal weight matrix W (and the intercept w_0) are found through a method known as Optimal Least Square (OLS). This method gives us a very simple equation for the optimal value of W which is:

So the optimal set of weights is not solved iteratively or numerically (like in a standard machine learning method). For this reason, this approach can be considered as a purely statistical, very simple approach. Despite its simplicity, the method works extremely well for both linear and polynomial cases, when the input and the output are related by a linear or polynomial dependency.

In the case of a polynomial dependency, we can apply the same exact equation that you see above, but we need to convert X into a matrix of polynomial features. This means creating additional columns in X that represent the powers of the original features (e.g., X²,X³, etc.) up to the desired degree of the polynomial.

2.2 Code

There are a bunch of notebooks that explain linear regression in detail, and implementing Linear Regression from scratch is a very good exercise to do.

That being said, the code is extremely simple for both linear and polynomial regression.

For Linear regression the

Simulate the X array (from 1 to 20, 100 points)
Simulate the corresponding Y array
Call the LinearRegression object from sklearn
Do plot and print visualization

For polynomial regression the only different thing is that there is an extra step between 1 and 2 where the matrix X is transformed into a matrix X_poly where X_poly = [X,X², X³ …, X^k].

As we already said, what we are essentially doing is training a linear model on polynomial data. This is an example for polynomial with 3rd degree:

And this is the plot:

3. Principal Component Regressor

3.1 Explanation

Now, there are cases where your dataset has a large number of columns (dimensions) that we call "k".

In this case (or k-ase? lol), maybe it is possible that your target Y is related to a **Principal Component Analysis*** (PCA) component of your dataset.

*Principal Component Analysis (PCA) is a widely used dimensionality reduction technique that identifies the directions (principal components) in which the data varies the most. It transforms the original variables into a new set of orthogonal variables (principal components) that are uncorrelated, effectively reducing the dimensionality while retaining most of the variance in the data. I talk about PCA with hands on experience in this article.

So for example, let’s take a case where k=2. I know that it’s a little weird because this method is usually used for large k, but I’m just doing this because visualizing a 2D plot is easier than visualizing a 1000D plot 🙂

3.2 Code

So in this case with 2 dimensions, let’s have that the first dimension is linearly correlated with the second, like this:

What we are going to do at this point is to fit the PCA on this 2D dataset. We can do this in a few lines:

Calling the PCA from sklearn
Fitting it on the dataset X
Transform the dataset and store the result on another variable X_PCA

That’s the plot, and as we can see now these two variables are not correlated (that’s how it should be if PCA is done well):

Now in this toy example, let’s consider the case where our target Y is correlated with the PCA Component 2. In that case, if we just try to do our linear regression we are cooked (as the youngsters say) because this is how it looks like:

No chance of getting any linear regression out of this bad boy. Nonetheless, if we plot the PCA Components vs Y we immediately see something that we are going to like a lot:

There is a clear relationship between Component PCA 2 and our target Y. This means that we can easily do the Linear regression not on k=2 variables but on a single variable (k’=1).

This is how we do it:

And this is the plot. Smoother than ever 🙂

4. Gaussian Process Regressor

4.1 Explanation

My wife always says "better safe than sorry": sometimes you don’t just want the prediction of the target Y at a given X, but you also want the uncertainty of our prediction. In other words, we want to have some boundaries where we know that we can confidently locate our prediction.

A way to do that is the Gaussian Process Regression (GPR). A "gaussian process" is a collection of stochastic variables. These stochastic variables follow a joint gaussian distribution.

How do we use these joint gaussian distributions? The predictive distribution for a new variable x_test is obtained from the n + 1 dimensional joint Gaussian distribution for the outputs of the n training cases and the test case, by conditioning on the observed targets in the training set.

I talked about Gaussian Process Regression in this blog post.

4.2 Code

Everything can be done in a few lines of code:

Now as you can see the input is not linear and/or polynomial, but it can be modeled using the Gaussian Process Regression, so this method can be used for nonlinear predictions as well.

GPR is particularly known for the complexity in choosing the best kernel, which would be, in statistical terms, the covariance of the matrix that we use to compute our predictions. A guide on how to choose the best kernel can be found here.

5. Support Vector Regressor (SVR)

5.1 Explanation

Support Vector Regression (SVR) is used a lot for cases where the relationship between input and output is nonlinear and complex.

The property of SVR and Support Vector Machines (SVMs) in general is that they use the concept of margin as a confidence level of the predictions that we have. Let me explain it further.

Let’s consider a classification algorithm where you use a line to distinguish elements of class A and elements of class B. Imagine that I give you this line and everything to the right of this line is classified as "A" and everything to the left of this line is classified as "B". Now, if a point is very close to the line (either left or right) it would be very risky to classify it because a minor change to the left or the right would completely change the predicted class of the object. In a few words, you don’t want your margin of decision to be small.

The idea of SVR is to increase that margin, in a way that you are as confident as you can of the value that you are predicting. SVR can efficiently handle high-dimensional data and can be customized with different kernel functions (such as linear, polynomial, and radial basis function kernels) to capture complex nonlinear relationships within the data.

5.2 Code

The code for SVR is the following for a very complex relationship between X and Y:

And as we can see, SVR does a pretty good job in guessing this behavior correctly:

6. Other methods

I want to come clean: this taxonomy is sweet and short. Being sweet and short it can’t be long and fully comprehensive of all the methodologies (it would take a 15+ pages long paper to do so).

I want to give you three more methods to do regression tasks though, just so you know that there is more out there:

XGBoost Regressor. This is a very popular one and it uses a boosting framework where models are trained sequentially. Each new model attempts to correct the errors made by the previous ones.
Decision Trees and Random Forest. Decision trees split data into subsets based on feature values, forming a tree structure for predictions. Random Forests improve accuracy by combining predictions from multiple random trees. Random forests are known to prevent overfitting.
Neural Networks. This is super well known and it’s on the news everyday 🙂 **** Neural networks consist of interconnected layers of nodes that can learn complex patterns, making them versatile for modeling non-linear relationships in regression tasks.

7. Conclusions

In this blog post, we talked about regression methods. In particular, we did this:

We described the difference between regression and classification.
We identified the things to consider to determine the best regression algorithm.
We built a taxonomy based on problem complexity, the number of columns (features) of the input, and the need for a probabilistic output.
We described the Linear/Polynomial Regression method, Gaussian Process Regression, Support Vector Machines, and Principal Component Regressor with examples and coding exercises.
We briefly described other regression methods for context.

8. About me!

Thank you again for your time. It means a lot ❤

My name is Piero Paialunga and I’m this guy here:

Image made by author

I am a Ph.D. candidate at the University of Cincinnati Aerospace Engineering Department and a Machine Learning Engineer for Gen Nine. I talk about AI, and Machine Learning in my blog posts and on Linkedin. If you liked the article and want to know more about machine learning and follow my studies you can:

A. Follow me on Linkedin, where I publish all my stories B. Subscribe to my newsletter. It will keep you updated about new stories and give you the chance to text me to receive all the corrections or doubts you may have. C. Become a referred member, so you won’t have any "maximum number of stories for the month" and you can read whatever I (and thousands of other Machine Learning and Data Science top writers) write about the newest technology available. D. Want to work with me? Check my rates and projects on Upwork!

If you want to ask me questions or start a collaboration, leave a message here or on Linkedin:

[email protected]

Which Regression technique should you use?

1. Regression Taxonomy

2. Linear/Polynomial Regression

2.1 Explanation

2.2 Code

3. Principal Component Regressor

3.1 Explanation

3.2 Code

4. Gaussian Process Regressor

4.1 Explanation

4.2 Code

5. Support Vector Regressor (SVR)

5.1 Explanation

5.2 Code

6. Other methods

7. Conclusions

8. About me!

Related Articles

Implementing Convolutional Neural Networks in TensorFlow

What Do Large Language Models “Understand”?

How to Forecast Hierarchical Time Series

Hands-on Time Series Anomaly Detection using Autoencoders, with Python

3 AI Use Cases (That Are Not a Chatbot)

Solving a Constrained Project Scheduling Problem with Quantum Annealing

Back To Basics, Part Uno: Linear Regression and Cost Function