Publish AI, ML & data-science insights to a global community of data professionals.

Dummy Regressor, Explained: A Visual Guide with Code Examples for Beginners

Naively choosing the best number for all of your prediction

REGRESSION ALGORITHM

There are a lot of times when my students come to me saying that they want to try the most sophisticated model out there for their machine learning tasks, and sometimes, I jokingly said, "Have you tried the best ever model first?" Especially in regression case (where we don’t have that "100% accuracy" goal), some machine learning models seemingly get a good low error score but when you compare it with the dummy model, it’s actually… not that great.

So, here’s dummy regressor. Just like in classifier, the regression task also has its baseline model – the first model you have to try to get the rough idea of how much better your machine learning could be.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.
All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Definition

A dummy regressor is a simple machine learning model that predicts numerical values using basic rules, without actually learning from the input data. Like its classification counterpart, it serves as a baseline for comparing the performance of more complex regression models. The dummy regressor helps us understand if our models are actually learning useful patterns or just making naive predictions.

Dummy Regressor is the simplest machine learning model imaginable.
Dummy Regressor is the simplest machine learning model imaginable.

📊 Dataset & Libraries

Throughout this article, we’ll use this simple artificial golf dataset as an example. This dataset predicts the number of golfers visiting our golf course. It includes features like outlook, temperature, humidity, and wind, with the target variable being the number of golfers.

Columns: 'Outlook', 'Temperature' (in Fahrenheit), 'Humidity' (in %), 'Wind' (Yes/No) and 'Number of Players' (numerical, target feature)
Columns: ‘Outlook’, ‘Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Wind’ (Yes/No) and ‘Number of Players’ (numerical, target feature)
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Create dataset
dataset_dict = {
    'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
    'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
    'Num_Players': [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]
}
df = pd.DataFrame(dataset_dict)

# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)

# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)

# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

Evaluating Regression Result

Before getting into the dummy regressor itself, let’s recap the method to evaluate the regression result. While in classification case, it is very intuitive to check the accuracy of the model (just check the ratio of the matching values), in regression, it is a bit different.

RMSE (root mean squared error) is like a score for regression models. It tells us how far off our predictions are from the actual values. Just as we want high accuracy in classification to get more right answers, we want a low RMSE in regression to be closer to the true values.

People like using RMSE because its value is in the same type as what we’re trying to guess.

Having RMSE = 3 can be interpreted that the actual value is within ±3 range from the prediction.
Having RMSE = 3 can be interpreted that the actual value is within ±3 range from the prediction.
from sklearn.metrics import mean_squared_error

y_true = np.array([10, 15, 20, 15, 10]) # True labels
y_pred = np.array([15, 11, 18, 14, 10]) # Predicted values

# Calculate RMSE using scikit-learn
rmse = mean_squared_error(y_true, y_pred, squared=False)

print(f"RMSE = {rmse:.2f}")

With that in mind, let’s get into the algorithm.

Main Mechanism

Dummy Regressor makes predictions based on simple rules, such as always returning the mean or median of the target values in the training data.

For our golf dataset, a dummy regressor might always predict "40.5" for number of players as that is the median of the training label.
For our golf dataset, a dummy regressor might always predict "40.5" for number of players as that is the median of the training label.

Training Steps

It’s a bit of a lie saying that there’s any training process in dummy regressor but anyway, here’s a general outline:

1. Select Strategy

Choose one of the following strategies:

  • Mean: Always predicts the mean of the training target values.
  • Median: Always predicts the median of the training target values.
  • Constant: Always predicts a constant value provided by the user.
Depends on the strategy, Dummy Regressor makes different numerical prediction.
Depends on the strategy, Dummy Regressor makes different numerical prediction.
from sklearn.dummy import DummyRegressor

# Choose a strategy for your DummyRegressor ('mean', 'median', 'constant')
strategy = 'median'

2. Calculate the Metric

Calculate either mean or median, depending on your strategy.

The algorithm is simply calculating the median of the training data— in this case we get 40.5.
The algorithm is simply calculating the median of the training data— in this case we get 40.5.
# Initialize the DummyRegressor
dummy_reg = DummyRegressor(strategy=strategy)

# "Train" the DummyRegressor (although no real training happens)
dummy_reg.fit(X_train, y_train)

3. Apply Strategy to Test Data

Use the chosen strategy to generate a list of predicted numerical labels for your test data.

If we choose the "median" strategy, the calculated median (40.5) will simply be the prediction for everything.
If we choose the "median" strategy, the calculated median (40.5) will simply be the prediction for everything.
# Use the DummyRegressor to make predictions
y_pred = dummy_reg.predict(X_test)
print("Label     :",list(y_test))
print("Prediction:",list(y_pred))

Evaluate the Model

Dummy regressor with this strategy gives error value of 13.28 as the baseline for future models.
Dummy regressor with this strategy gives error value of 13.28 as the baseline for future models.
# Evaluate the Dummy Regressor's error
from sklearn.metrics import mean_squared_error

rmse = mean_squared_error(y_test, y_pred, squared=False)
print(f"Dummy Regression Error: {rmse.round(2)}")

Key Parameters

There’s only one main key parameter in dummy regressor, which is:

  1. Strategy: This determines how the regressor makes predictions. Common options include:

    • mean: Provides an average baseline, commonly used for general scenarios.
    • median: More robust against outliers, good for skewed target distributions.
    • constant: Useful when domain knowledge suggests a specific constant prediction.
  2. Constant: When using the ‘constant’ strategy, this parameter specifies which class to always predict.
Regardless of the strategy used, the result are all equally bad but for sure our next regression model should have RMSE value lower than 12.
Regardless of the strategy used, the result are all equally bad but for sure our next regression model should have RMSE value lower than 12.

Pros and Cons

As a lazy predictor, dummy regressor for sure have their strengths and limitations.

Pros:

  1. Easy Benchmark: Quickly shows the minimum performance other models should beat.
  2. Fast: Takes no time to set up and run.

Cons:

  1. Doesn’t Learn: Just uses simple rules, so it’s often outperformed by real models.
  2. Ignores Features: Doesn’t consider any input data when making predictions.

Final Remarks

Using dummy regressor should be the first step whenever we have a regression task. They provide a standard base line, so that we are sure that a more complex model actually gives better result rather than random prediction. As you learn more advanced technique, never forget to compare your models against these simple baselines – these naive prediction might be what you first need!

🌟 Dummy Regressor Code Summarized

# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.dummy import DummyRegressor

# Create dataset
dataset_dict = {
    'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
    'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
    'Num_Players': [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]
}
df = pd.DataFrame(dataset_dict)

# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)

# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)

# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Initialize and train the model
dummy_reg = DummyRegressor(strategy='median')
dummy_reg.fit(X_train, y_train)

# Make predictions
y_pred = dummy_reg.predict(X_test)

# Calculate and print RMSE
print(f"RMSE: {mean_squared_error(y_test, y_pred, squared=False)}")

Further Reading

For a detailed explanation of the DummyRegressor and its implementation in scikit-learn, readers can refer to the official documentation, which provides comprehensive information on its usage and parameters.

Technical Environment

This article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions.

About the Illustrations

Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.

𝙎𝙚𝙚 𝙢𝙤𝙧𝙚 𝙍𝙚𝙜𝙧𝙚𝙨𝙨𝙞𝙤𝙣 𝘼𝙡𝙜𝙤𝙧𝙞𝙩𝙝𝙢𝙨 𝙝𝙚𝙧𝙚:

Regression Algorithms

𝙔𝙤𝙪 𝙢𝙞𝙜𝙝𝙩 𝙖𝙡𝙨𝙤 𝙡𝙞𝙠𝙚:

Classification Algorithms


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles