Publish AI, ML & data-science insights to a global community of data professionals.

Decision Tree Regressor, Explained: A Visual Guide with Code Examples

Trimming branches smartly with Cost-Complexity Pruning

REGRESSION ALGORITHM

Decision Tree Classifier, Explained: A Visual Guide with Code Examples for Beginners

Decision Trees aren’t limited to categorizing data – they’re equally good at predicting numerical values! Classification trees often steal the spotlight, but Decision Tree Regressors (or Regression Trees) are powerful and versatile tools in the world of continuous variable prediction.

While we’ll discuss the mechanics of regression tree construction (which are mostly similar to the classification tree), here, we’ll also advance beyond the pre-pruning methods like "minimal sample leaf" and "max tree depth" introduced in the classifier article. We’ll explore the most common post-pruning method which is cost complexity pruning, that introduces a complexity parameter to the decision tree’s cost function.

All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.
All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

Definition

A Decision Tree for regression is a model that predicts numerical values using a tree-like structure. It splits data based on key features, starting from a root question and branching out. Each node asks about a feature, dividing data further until reaching leaf nodes with final predictions. To get a result, you follow the path matching your data’s features from root to leaf.

Decision Trees for regression predict numerical outcomes by following a series of data-driven questions, narrowing down to a final value.
Decision Trees for regression predict numerical outcomes by following a series of data-driven questions, narrowing down to a final value.

📊 Dataset Used

To demonstrate our concepts, we’ll work with our standard dataset. This dataset is used to predict the number of golfers visiting on a given day and includes variables like weather outlook, temperature, humidity, and wind conditions.

Columns: 'Outlook' (one-hot encoded to sunny, overcast, rain), 'Temperature' (in Fahrenheit), 'Humidity' (in %), 'Wind' (Yes/No) and 'Number of Players' (numerical, target feature)
Columns: ‘Outlook’ (one-hot encoded to sunny, overcast, rain), ‘Temperature’ (in Fahrenheit), ‘Humidity’ (in %), ‘Wind’ (Yes/No) and ‘Number of Players’ (numerical, target feature)
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Create dataset
dataset_dict = {
    'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
    'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
    'Num_Players': [52, 39, 43, 37, 28, 19, 43, 47, 56, 33, 49, 23, 42, 13, 33, 29, 25, 51, 41, 14, 34, 29, 49, 36, 57, 21, 23, 41]
}

df = pd.DataFrame(dataset_dict)

# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'],prefix='',prefix_sep='')

# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)

# Rearrange columns
column_order = ['sunny', 'overcast', 'rain', 'Temperature', 'Humidity', 'Wind', 'Num_Players']
df = df[column_order]

# Split features and target
X, y = df.drop('Num_Players', axis=1), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

Main Mechanism

The Decision Tree for regression operates by recursively dividing the data based on features that best reduce prediction error. Here’s the general process:

  1. Begin with the entire dataset at the root node.
  2. Choose the feature that minimizes a specific error metric (such as mean squared error or variance) to split the data.
  3. Create child nodes based on the split, where each child represents a subset of the data aligned with the corresponding feature values.
  4. Repeat steps 2–3 for each child node, continuing to split the data until a stopping condition is reached.
  5. Assign a final predicted value to each leaf node, typically the average of the target values in that node.

Training Steps

We will explore the regression part in the decision tree algorithm CART (Classification and Regression Trees). It builds binary trees and typically follows these steps:

1.Begin with all training samples in the root node.

2.For each feature in the dataset: a. Sort the feature values in ascending order. b. Consider all midpoints between adjacent values as potential split points.

In total, there are 23 split points to check.
In total, there are 23 split points to check.
  1. For each potential split point: a. Calculate the mean squared error (MSE) of the current node. b. Compute the weighted average of errors for the resulting split.
As an example, here, we calculated the weighted average of MSE for split point "Temperature" with value 73.5
As an example, here, we calculated the weighted average of MSE for split point "Temperature" with value 73.5
def calculate_split_mse(X_train, y_train, feature_name, split_point):
    # Create DataFrame and sort by feature
    analysis_df = pd.DataFrame({
        'feature': X_train[feature_name],
        'y_actual': y_train
    }).sort_values('feature')

    # Split data and calculate means
    left_mask = analysis_df['feature'] <= split_point
    left_mean = analysis_df[left_mask]['y_actual'].mean()
    right_mean = analysis_df[~left_mask]['y_actual'].mean()

    # Calculate squared differences
    analysis_df['squared_diff'] = np.where(
        left_mask,
        (analysis_df['y_actual'] - left_mean) ** 2,
        (analysis_df['y_actual'] - right_mean) ** 2
    )

    # Calculate MSEs and counts
    left_mse = analysis_df[left_mask]['squared_diff'].mean()
    right_mse = analysis_df[~left_mask]['squared_diff'].mean()
    n_left = sum(left_mask)
    n_right = len(analysis_df) - n_left

    # Calculate weighted average MSE
    weighted_mse = (n_left * left_mse + n_right * right_mse) / len(analysis_df)

    # Print results
    print(analysis_df)
    print(f"nResults for split at {split_point} on feature '{feature_name}':")
    print(f"Left child MSE (n={n_left}, mean={left_mean:.2f}): {left_mse:.2f}")
    print(f"Right child MSE (n={n_right}, mean={right_mean:.2f}): {right_mse:.2f}")
    print(f"Weighted average MSE: {weighted_mse:.2f}")

# Example usage:
calculate_split_mse(X_train, y_train, 'Temperature', 73.5)
  1. After evaluating all features and split points, select the one with lowest weighted average of MSE.
def evaluate_all_splits(X_train, y_train):
    """Evaluate all possible split points using midpoints for all features"""
    results = []

    for feature in X_train.columns:
        data = pd.DataFrame({'feature': X_train[feature], 'y_actual': y_train})
        splits = [(a + b)/2 for a, b in zip(sorted(data['feature'].unique())[:-1], 
                                          sorted(data['feature'].unique())[1:])]

        for split in splits:
            left_mask = data['feature'] <= split
            n_left = sum(left_mask)

            if not (0 < n_left < len(data)): continue

            left_mean = data[left_mask]['y_actual'].mean()
            right_mean = data[~left_mask]['y_actual'].mean()

            left_mse = ((data[left_mask]['y_actual'] - left_mean) ** 2).mean()
            right_mse = ((data[~left_mask]['y_actual'] - right_mean) ** 2).mean()

            weighted_mse = (n_left * left_mse + (len(data) - n_left) * right_mse) / len(data)

            results.append({'Feature': feature, 'Split_Point': split, 'Weighted_MSE': weighted_mse})

    return pd.DataFrame(results).round(2)

# Example usage:
results = evaluate_all_splits(X_train, y_train)
print(results)
  1. Create two child nodes based on the chosen feature and split point:
    • Left child: samples with feature value <= split point
    • Right child: samples with feature value > split point
  1. Recursively repeat steps 2–5 for each child node. (Continue until a stopping criterion is met.)
  1. At each leaf node, assign the average target value of the samples in that node as the prediction.
from sklearn.tree import DecisionTreeRegressor, plot_tree
import matplotlib.pyplot as plt

# Train the model
regr = DecisionTreeRegressor(random_state=42)
regr.fit(X_train, y_train)

# Visualize the decision tree
plt.figure(figsize=(26,8))
plot_tree(regr, feature_names=X.columns, filled=True, rounded=True, impurity=False, fontsize=16, precision=2)
plt.tight_layout()
plt.show()
In this scikit-learn output, the samples and values are shown for the leaf nodes and interim nodes.
In this scikit-learn output, the samples and values are shown for the leaf nodes and interim nodes.

Regression/Prediction Step

Here’s how a regression tree makes predictions for new data:

  1. Start at the top (root) of the tree.
  2. At each decision point (node):
    • Look at the feature and split value.
    • If the data point’s feature value is smaller or equal, go left.
    • If it’s larger, go right.
  3. Keep moving down the tree until you reach the end (a leaf).
  4. The prediction is the average value stored in that leaf.

Evaluation Step

This value of RMSE is so much better than the result of the dummy regressor.
This value of RMSE is so much better than the result of the dummy regressor.

Pre-pruning vs Post-pruning

After building the tree, the only thing we need to worry about is the method to make the tree smaller to prevent overfitting. In general, the method of pruning can be categorized as:

Pre-pruning

Pre-pruning, also known as early stopping, involves halting the growth of a decision tree during the training process based on certain predefined criteria. This approach aims to prevent the tree from becoming too complex and overfitting the training data. Common pre-pruning techniques include:

  1. Maximum depth: Limiting how deep the tree can grow.
  2. Minimum samples for split: Requiring a minimum number of samples to justify splitting a node.
  3. Minimum samples per leaf: Ensuring each leaf node has at least a certain number of samples.
  4. Maximum number of leaf nodes: Restricting the total number of leaf nodes in the tree.
  5. Minimum impurity decrease: Only allowing splits that decrease impurity by a specified amount.

These methods stop the tree’s growth when the specified conditions are met, effectively "pruning" the tree during its construction phase. (We have discussed these methods before, which is exactly the same in regression case.)

Post-pruning

Post-pruning, on the other hand, allows the decision tree to grow to its full extent and then prunes it back to reduce complexity. This approach first builds a complete tree and then removes or collapses branches that don’t significantly contribute to the model’s performance. One common post-pruning technique is called Cost-Complexity Pruning.

Cost Complexity Pruning

Step 1: Calculate the Impurity for Each Node

For each interim node, calculate the impurity (MSE for regression case). We then sorted this value from the lowest to highest.

# Visualize the decision tree
plt.figure(figsize=(26,8))
plot_tree(regr, feature_names=X.columns, filled=True, rounded=True, impurity=True, fontsize=16, precision=2)
plt.tight_layout()
plt.show()
In this scikit learn output, the impurity are shown as "squared_error" for each nodes.
In this scikit learn output, the impurity are shown as "squared_error" for each nodes.
Let's give name to these interim nodes (from A-J). We then sort it based on their MSE, from lowest to highest
Let’s give name to these interim nodes (from A-J). We then sort it based on their MSE, from lowest to highest

Step 2: Create Subtrees by Trimming The Weakest Link

The goal is to gradually turn the interim nodes into leaves starting from the node with the lowest MSE (= weakest link). We can create a path of pruning based on that.

Let's name them "Subtree i" based on how many times (i) it is being pruned. Starting from the original tree, the tree will be pruned on the node with lowest MSE (starting from node J, M (already got cut by J), L, K, and so on)
Let’s name them "Subtree i" based on how many times (i) it is being pruned. Starting from the original tree, the tree will be pruned on the node with lowest MSE (starting from node J, M (already got cut by J), L, K, and so on)

Step 3: Calculate Total Leaf Impurities for Each Subtree

For each subtree T, total leaf impurities (R(T)) can be calculated as:

R(T) = (1/N) Σ I(L) * _n__L

where: · L ranges over all leaf nodes · _nL is the number of samples in leaf L· N is the total number of samples in the tree · I(L) is the impurity (MSE) of leaf L

The more we prune, the higher the total leaf impurities.
The more we prune, the higher the total leaf impurities.

Step 4: Compute the Cost Function

To control when to stop turning the interim nodes into leaves, we check the cost complexity first for each subtree T using the following formula:

Cost(T) = R(T) + α * |T|

where: · R(T) is the total leaf impurities · |T| is the number of leaf nodes in the subtree__· α is the complexity parameter

Step 5: Select the Alpha

The value of alpha control which subtree we will end up with. The subtree with the lowest cost will be the final tree.

When α is small, we care more about accuracy (bigger trees). When α is large, we care more about simplicity (smaller trees)
When α is small, we care more about accuracy (bigger trees). When α is large, we care more about simplicity (smaller trees)

While we can freely set the α, in scikit-learn, you can also get the smallest value of α to obtain a particular subtree. This is called effective _α._

This effective α can also be computed.
This effective α can also be computed.
# Compute the cost-complexity pruning path
tree = DecisionTreeRegressor(random_state=42)
effective_alphas = tree.cost_complexity_pruning_path(X_train, y_train).ccp_alphas
impurities = tree.cost_complexity_pruning_path(X_train, y_train).impurities

# Function to count leaf nodes
count_leaves = lambda tree: sum(tree.tree_.children_left[i] == tree.tree_.children_right[i] == -1 for i in range(tree.tree_.node_count))

# Train trees and count leaves for each complexity parameter
leaf_counts = [count_leaves(DecisionTreeRegressor(random_state=0, ccp_alpha=alpha).fit(X_train_scaled, y_train)) for alpha in effective_alphas]

# Create DataFrame with analysis results
pruning_analysis = pd.DataFrame({
    'total_leaf_impurities': impurities,
    'leaf_count': leaf_counts,
    'cost_function': [f"{imp:.3f} + {leaves}α" for imp, leaves in zip(impurities, leaf_counts)],
    'effective_α': effective_alphas
})

print(pruning_analysis)

Final Remarks

Pre-pruning methods are generally faster and more memory-efficient, as they prevent the tree from growing too large in the first place.

Post-pruning can potentially create more optimal trees, as it considers the entire tree structure before making pruning decisions. However, it can be more computationally expensive.

Both approaches aim to find a balance between model complexity and performance, with the goal of creating a model that generalizes well to unseen data. The choice between pre-pruning and post-pruning (or a combination of both) often depends on the specific dataset, the problem at hand, and of course, computational resources available.

In practice, it’s common to use a combination of these methods, like applying some pre-pruning criteria to prevent excessively large trees, and then using post-pruning for fine-tuning the model’s complexity.

🌟 Decision Tree Regressor (with Cost Complexity Pruning) Code Summarized

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler

# Create dataset
dataset_dict = {
    'Outlook': ['sunny', 'sunny', 'overcast', 'rain', 'rain', 'rain', 'overcast', 'sunny', 'sunny', 'rain', 'sunny', 'overcast', 'overcast', 'rain', 'sunny', 'overcast', 'rain', 'sunny', 'sunny', 'rain', 'overcast', 'rain', 'sunny', 'overcast', 'sunny', 'overcast', 'rain', 'overcast'],
    'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
    'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
    'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
    'Num_Players': [52,39,43,37,28,19,43,47,56,33,49,23,42,13,33,29,25,51,41,14,34,29,49,36,57,21,23,41]
}

df = pd.DataFrame(dataset_dict)

# One-hot encode 'Outlook' column
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)

# Convert 'Wind' column to binary
df['Wind'] = df['Wind'].astype(int)

# Split data into features and target, then into training and test sets
X, y = df.drop(columns='Num_Players'), df['Num_Players']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Initialize Decision Tree Regressor
tree = DecisionTreeRegressor(random_state=42)

# Get the cost complexity path, impurities, and effective alpha
path = tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
print(ccp_alphas)
print(impurities)

# Train the final tree with the chosen alpha
final_tree = DecisionTreeRegressor(random_state=42, ccp_alpha=0.1)
final_tree.fit(X_train_scaled, y_train)

# Make predictions
y_pred = final_tree.predict(X_test)

# Calculate and print RMSE
rmse = root_mean_squared_error(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")

Further Reading

For a detailed explanation of the Decision Tree Regressor, Cost Complexity Pruning, and its implementation in scikit-learn, readers can refer to their official documentation. It provides comprehensive information on their usage and parameters.

Technical Environment

This article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly with different versions

About the Illustrations

Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.

𝙎𝙚𝙚 𝙢𝙤𝙧𝙚 𝙍𝙚𝙜𝙧𝙚𝙨𝙨𝙞𝙤𝙣 𝘼𝙡𝙜𝙤𝙧𝙞𝙩𝙝𝙢𝙨 𝙝𝙚𝙧𝙚:

Regression Algorithms

𝙔𝙤𝙪 𝙢𝙞𝙜𝙝𝙩 𝙖𝙡𝙨𝙤 𝙡𝙞𝙠𝙚:

Classification Algorithms


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles