{"id":606509,"date":"2025-07-07T14:49:04","date_gmt":"2025-07-07T19:49:04","guid":{"rendered":"https:\/\/towardsdatascience.com\/?p=606509"},"modified":"2025-07-07T14:49:27","modified_gmt":"2025-07-07T19:49:27","slug":"build-algorithm-agnostic-ml-pipelines-in-a-breeze","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/","title":{"rendered":"Build Algorithm-Agnostic ML Pipelines in a\u00a0Breeze"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1751561045478\" class=\"mdspan-comment\">This is my 3rd article<\/mdspan> on the topic of algorithm-agnostic model building. You can find the previous two articles published on TDS below.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/towardsdatascience.com\/algorithm-agnostic-model-building-with-mlflow-b106a5a29535\/\">Algorithm-Agnostic Model Building with MLflow<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/medium.com\/data-science\/explainable-generic-ml-pipeline-with-mlflow-2494ca1b3f96\">Explainable Generic ML Pipeline with MLflow<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">After writing these two articles, I continued to develop the framework, and it gradually evolved into something much larger than I originally envisioned. Rather than squeezing everything into another article, I decided to package it as an open-source Python library called MLarena to share with fellow data and ML practitioners. MLarena is an algorithm-agnostic machine learning toolkit that supports model training, diagnostics, and optimization.<\/p>\n\n\n\n<p class=\"has-text-align-center wp-block-paragraph\">\ud83d\udd17You can find the full codebase on GitHub: <a href=\"https:\/\/github.com\/MenaWANG\/mlarena\">MLarena repo<\/a> \ud83e\uddf0<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At its core, MLarena is implemented as a custom <code>mlflow.pyfunc<\/code> model. This makes it fully compatible with the MLflow ecosystem, enabling robust experiment tracking, model versioning, and seamless deployment, regardless of which underlying ML library you use, and enables smooth migration between algorithms when necessary. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In addition, it also seeks to strike a balance between automation and expert insight in model development. Many tools either abstract away too much, making it hard to understand what&#8217;s happening under the hood, or require so much boilerplate that they slow down iteration. MLarena aims to bridge that gap: it automates routine machine learning tasks using best practices, while also providing tools for expert users to diagnose, interpret, and optimize their models more effectively.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the sections that follow, we\u2019ll look at how these ideas are reflected in the toolkit\u2019s design and walk through practical examples of how it can support real-world machine learning workflows.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">1. A Lightweight Abstraction for Training and Evaluation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">One of the recurring pain points in ML workflows is the amount of boilerplate code required just to get a working pipeline, especially when switching between algorithms or frameworks. MLarena introduces a lightweight abstraction that standardizes this process while remaining compatible with scikit-learn-style estimators.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s a simple example of how the core <code>MLPipeline<\/code> object works:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from mlarena import MLPipeline, PreProcessor\n\n# Define the pipeline\nmlpipeline_rf = MLPipeline(\n    model = RandomForestClassifier(), # works with any sklearn style algorithm\n    preprocessor = PreProcessor() \n)\n# Fit the pipeline\nmlpipeline_rf.fit(X_train,y_train)\n# Predict on new data and evaluate\nresults = mlpipeline_rf.evaluate(X_test, y_test)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This interface wraps together common preprocessing steps, model training, and evaluation. Internally, it auto-detects the task type (classification or regression), applies appropriate metrics, and generates a diagnostic report\u2014all without sacrificing flexibility in how models or preprocessors are defined (more on customization options later).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Rather than abstracting everything away, MLarena focuses on surfacing meaningful defaults and insights. The <code>evaluate<\/code> method doesn\u2019t just return scores, it produces a full report tailored to the task.  <\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-16fe2e600838f5d49e6fc31d587ee082\">1.1 Diagnostic Reporting<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For classification tasks, the evaluation report includes key metrics such as AUC, MCC, precision, recall, F1, and F-beta (when <code>beta<\/code> is specified). The visual outputs feature a ROC-AUC curve (bottom left), a confusion matrix (bottom right), and a precision\u2013recall\u2013threshold plot at the top. In this top plot, precision (blue), recall (red), and F-beta (green, with \u03b2 = 1 by default) are shown across different classification thresholds, with a vertical dotted line indicating the current threshold to highlight the trade-off. These visualizations are useful not only for technical diagnostics, but also for supporting discussions around threshold selection with domain experts (more on threshold optimization later).<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markdown\">=== Classification Model Evaluation ===\n\n1. Evaluation Parameters\n----------------------------------------\n\u2022 Threshold:   0.500    (Classification cutoff)\n\u2022 Beta:        1.000    (F-beta weight parameter)\n\n2. Core Performance Metrics\n----------------------------------------\n\u2022 Accuracy:    0.805    (Overall correct predictions)\n\u2022 AUC:         0.876    (Ranking quality)\n\u2022 Log Loss:    0.464    (Confidence-weighted error)\n\u2022 Precision:   0.838    (True positives \/ Predicted positives)\n\u2022 Recall:      0.703    (True positives \/ Actual positives)\n\u2022 F1 Score:    0.765    (Harmonic mean of Precision &amp; Recall)\n\u2022 MCC:         0.608    (Matthews Correlation Coefficient)\n\n3. Prediction Distribution\n----------------------------------------\n\u2022 Pos Rate:    0.378    (Fraction of positive predictions)\n\u2022 Base Rate:   0.450    (Actual positive class rate)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/1SjVITh4QutJOM4MMfC6mPQ.png\" alt=\"Three evaluation plots for classification models. \u2022 Metrics vs Threshold (Precision, Recall, F\u03b2, with vertical threshold line)  \u2022 ROC Curve  \u2022 Confusion Matrix (with colored overlays)\" class=\"wp-image-607266\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">For regression models, MLarena automatically adapts its evaluation metrics and visualisations:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markdown\">=== Regression Model Evaluation ===\n\n1. Error Metrics\n----------------------------------------\n\u2022 RMSE:         0.460      (Root Mean Squared Error)\n\u2022 MAE:          0.305      (Mean Absolute Error)\n\u2022 Median AE:    0.200      (Median Absolute Error)\n\u2022 NRMSE Mean:   22.4%      (RMSE\/mean)\n\u2022 NRMSE Std:    40.2%      (RMSE\/std)\n\u2022 NRMSE IQR:    32.0%      (RMSE\/IQR)\n\u2022 MAPE:         17.7%      (Mean Abs % Error, excl. zeros)\n\u2022 SMAPE:        15.9%      (Symmetric Mean Abs % Error)\n\n2. Goodness of Fit\n----------------------------------------\n\u2022 R\u00b2:           0.839      (Coefficient of Determination)\n\u2022 Adj. R\u00b2:      0.838      (Adjusted for # of features)\n\n3. Improvement over Baseline\n----------------------------------------\n\u2022 vs Mean:      59.8%      (RMSE improvement)\n\u2022 vs Median:    60.9%      (RMSE improvement)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/06\/image-162-1024x405.png\" alt=\"The evaluation plot for regression models. \u2022 Residual analysis (residuals vs predicted, with 95% prediction interval)  \u2022 Prediction error plot (actual vs predicted, with perfect prediction line and error bands)\" class=\"wp-image-607049\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">One danger in rapid iteration of ML project is that some underlying issues may go unnoticed. Therefore, in addition to the above metrics and plots, a <strong><em>Model Evaluation Diagnostics<\/em><\/strong> section will appear in the report when potential red flags are detected:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Regression Diagnostics<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u26a0\ufe0f&nbsp;Sample-to-feature ratio warnings: Alerts when n\/k &lt; 10, indicating high overfitting risk<br>\u2139\ufe0f MAPE transparency: Reports how many observations were excluded from MAPE due to zero target values<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Classification Diagnostics<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u26a0\ufe0f&nbsp;Data leakage detection: Flags near-perfect AUC (&gt;99%) that often indicates leakage<br>\u26a0\ufe0f&nbsp;Overfitting alerts: Same n\/k ratio warnings as regression<br>\u2139\ufe0f Class imbalance awareness: Flags severely imbalanced class distributions<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Below is an overview of MLarena\u2019s evaluation reports for both classification and regression tasks:<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/1Y2Wx2w3quRaaHb6EzAavUA.png\" alt=\"A table that lists the metrics and plots demonstrated above for regression and classification models side by side.\" class=\"wp-image-607272\"\/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-4bd4fe1bee263f97969efde2c6e0feae\">1.2 Explainability as a Built-In Layer<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Explainability in machine learning projects is crucial for multiple reasons:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Model Selection<em><br><\/em><\/strong>Explainability helps us choose the best model by letting us evaluate the soundness of its reasoning. Even if two models show similar performance metrics, examining the features they rely on with domain experts can reveal which model&#8217;s logic aligns better with real-world understanding.<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>Troubleshooting<\/strong><br>Analyzing a model&#8217;s reasoning is a powerful troubleshooting strategy for improvement. For instance, by investigating why a classification model confidently made a mistake, we can pinpoint the contributing features and correct its reasoning.<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>Model Monitoring<\/strong><br>Beyond typical performance and data drift checks, monitoring model reasoning is highly informative. Getting alerted to significant shifts in the key features driving a production model&#8217;s decisions helps maintain its reliability and relevance.<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>Model Implementation<\/strong><br>Providing model reasoning alongside predictions can be incredibly valuable to end-users. For example, a customer service agent could use a churn score along with the specific customer features that lead to that score to better retain a customer.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">To support model interpretability, the <code>explain_model<\/code> method gives you <strong><em>global explanations<\/em><\/strong>, revealing which features have the most significant impact on your model\u2019s predictions.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">mlpipeline.explain_model(X_test)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/12577f7R6kuIeJYhxkF__vw.png\" alt=\"shap plot for global feature importance\" class=\"wp-image-607265\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>explain_case<\/code> method provides<strong><em> local explanations<\/em><\/strong> for individual cases, helping us understand how each feature contributes to each specific prediction.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">mlpipeline.explain_case(5)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/1qtIo5So1VtOxquULi2ZoAA.png\" alt=\"shap plot for local feature importance, i.e., feature contributions to each prediction\" class=\"wp-image-607263\"\/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-f640b364fabea99f02f3de3da38779db\">1.3 Reproducibility and Deployment Without Extra Overhead<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">One persistent challenge in machine learning projects is ensuring that models are reproducible and production-ready\u2014not just as code, but as complete artifacts that include preprocessing, model logic, and metadata. Often, the path from a working notebook to a deployable model involves manually wiring together multiple components and remembering to track all relevant configurations.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To reduce this friction, <code>MLPipeline<\/code> is implemented as a custom <code>mlflow.pyfunc<\/code> model. This design choice allows the entire pipeline ( including the preprocessing steps and trained model), to be packaged as a single, portable artifact.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When evaluating a pipeline, you can enable MLflow logging by setting <code>log_model=True<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">results = mlpipeline.evaluate(\n    X_test, y_test, \n    log_model=True # to log the pipeline with mlflow\n)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Behind the scenes, this triggers a series of MLflow operations:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Starts and manages an MLflow run<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Logs model hyperparameters and evaluation metrics<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Saves the complete pipeline object as a versioned artifact<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Automatically infers the model signature to reduce deployment errors<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This helps teams maintain experiment traceability and move from experimentation to deployment more smoothly, without duplicating tracking or serialization code. The resulting artifact is compatible with the MLflow Model Registry and can be deployed through any of MLflow\u2019s supported backends.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Tuning Models with Efficiency and Stability in Mind<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Hyperparameter tuning is one of the most resource-intensive parts of building machine learning models. While search techniques like grid or random search are common, they can be computationally expensive and often inefficient, especially when applied to large or complex search spaces.&nbsp;Another big concern in hyperparameter optimization is that it could produce unstable models that perform well in development but degrade in production.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/1WfMiVJORBdT19mvPOVEmYQ.png\" alt=\"A table that compares grid search, random search and Bayesian optimization for hyperparameter tuning.\" class=\"wp-image-607273\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To address these issues, MLarena includes a <code>tune<\/code> method that simplifies the process of hyperparameter optimization while encouraging robustness and transparency. It builds on Bayesian optimization\u2014an efficient search strategy that adapts based on previous results\u2014and adds guardrails to avoid common pitfalls like overfitting or incomplete search space coverage.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-85d1533f3ee49df5351f0e841e38cda4\">2.1 Hyperparameter Optimization with Built-In Early Stopping and Variance Control<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s an example of how to run tuning using LightGBM and a custom search space:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from mlarena import MLPipeline, PreProcessor\nimport lightgbm as lgb\n\nlgb_param_ranges = {\n    &#039;learning_rate&#039;: (0.01, 0.1),  \n    &#039;n_estimators&#039;: (100, 1000),   \n    &#039;num_leaves&#039;: (20, 100),\n    &#039;max_depth&#039;: (5, 15),\n    &#039;colsample_bytree&#039;: (0.6, 1.0),\n    &#039;subsample&#039;: (0.6, 0.9)\n}\n\n# setting up with default settings, see customization below \nbest_pipeline = MLPipeline.tune(\n    X_train, \n    y_train,\n    algorithm=lgb.LGBMClassifier, # works with any sklearn style algorithm\n    preprocessor=PreProcessor(),\n    param_ranges=lgb_param_ranges \n    )<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To avoid unnecessary computation, the tuning process includes support for early stopping: you can set a maximum number of evaluations, and stop the process automatically if no improvement is observed after a specified number of trials. This saves computation time while focusing the search on the most promising parts of the search space.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">best_pipeline = MLPipeline.tune(\n    ... \n    max_evals=500,       # maximum optimization iterations\n    early_stopping=50,   # stop if no improvement after 50 trials\n    n_startup_trials=5,  # minimum trials before early stopping kicks in\n    n_warmup_steps=0,    # steps per trial before pruning    \n    )<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To ensure robust results, MLarena applies cross-validation during hyperparameter tuning. Beyond optimizing for average performance, it also allows you to penalize high variance across folds using the <code>cv_variance_penalty<\/code> parameter. This is particularly valuable in real-world scenarios where model stability can be just as important as raw accuracy.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">best_pipeline = MLPipeline.tune(\n    ...\n    cv=5,                    # number of folds for cross-validation\n    cv_variance_penalty=0.3, # penalize high variance across folds\n    )<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">For example, between two models with identical mean AUC, the one with lower variance across folds is often more reliable in production. It will be selected by MLarena tuning due to its better effective score, which is <code>mean_auc - std * cv_variance_penalty<\/code>:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-center\" data-align=\"center\">Model<\/th><th>Mean AUC<\/th><th>Std Dev<\/th><th>Effective Score<\/th><\/tr><\/thead><tbody><tr><td class=\"has-text-align-center\" data-align=\"center\">A<\/td><td>0.85<\/td><td>0.02<\/td><td>0.85 &#8211; 0.02 *<br>0.3 (penalty)<\/td><\/tr><tr><td class=\"has-text-align-center\" data-align=\"center\">B<\/td><td>0.85<\/td><td>0.10<\/td><td>0.85 &#8211; 0.10 *<br>0.3 (penalty)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-d18e007fab76721d00f6baeb6f337dd8\">2.2 Diagnosing Search Space Design with Visual Feedback<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Another frequent bottleneck in tuning is designing a good search space. If the range for a hyperparameter is too narrow or too broad, the optimizer may waste iterations or miss high-performing regions entirely.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To support more informed search design, MLarena includes a <strong>parallel coordinates plot<\/strong> that visualizes how different hyperparameter values relate to model performance:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">You can <strong>spot trends<\/strong>, such as which ranges of <code>learning_rate<\/code> consistently yield better results.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">You can <strong>identify edge clustering<\/strong>, where top-performing trials are bunched at the boundary of a parameter range, often a sign that the range needs adjustment.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">You can <strong>see interactions<\/strong> across multiple hyperparameters, helping refine your intuition or guide further exploration.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This kind of visualization helps users refine search spaces iteratively, leading to better results with fewer iterations.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">best_pipeline = MLPipeline.tune(\n    ...\n    # to show parallel coordinate plot:\n    visualize = True # default=True\n    )<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/17W5f4n3aeexNDs02Egbl6A.png\" alt=\"Parallel coordinates plot that visualize all the runs in the hyperparameter search. Each line in the plot represent one unique trial.\" class=\"wp-image-607271\"\/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-48a58c2f7a24625b80e298c4fdf15795\">2.3 Choosing the Right Metric for the Problem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The objective of tuning isn\u2019t always the same: in some cases, you want to maximize AUC, in others, you may care more about minimizing RMSE or SMAPE. But different metrics also require different optimization directions\u2014and when combined with cross-validation variance penalty, which either needs to be added to or subtracted from the CV mean depending on the optimization direction, the math can get tedious. \ud83d\ude05<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">MLarena simplifies this by supporting a wide range of metrics for both classification and regression:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Classification metrics:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><code>auc<\/code> (default)<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><code>f1<\/code><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><code>accuracy<\/code><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><code>log_loss<\/code><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><code>mcc<\/code><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Regression metrics:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><code>rmse<\/code> (default)<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><code>mae<\/code><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><code>median_ae<\/code><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><code>smape<\/code><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><code>nrmse_mean<\/code>, <code>nrmse_iqr<\/code>, <code>nrmse_std<\/code><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">To switch metrics, simply pass <code>tune_metric<\/code> to the method:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">best_pipeline = MLPipeline.tune(\n    ...\n    tune_metric = &quot;f1&quot;\n    )<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">MLarena handles the rest, automatically determining whether the metric should be maximized or minimized and applying the variance penalty consistently.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Tackling Real-World Preprocessing Challenges<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Preprocessing is often one of the most overlooked steps in machine learning workflows, and also one of the most error-prone. Dealing with missing values, high-cardinality categoricals, irrelevant features, and inconsistent column naming can introduce subtle bugs, degrade model performance, or block production deployment altogether.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">MLarena&#8217;s <code>PreProcessor<\/code> was designed to make this step more robust and less ad hoc. It offers sensible defaults for common use cases, while providing the flexibility and tooling needed for more complex scenarios.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s an example of the default configuration:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from mlarena import PreProcessor\n\npreprocessor = PreProcessor(\n    num_impute_strategy=&quot;median&quot;,          # Numeric missing value imputation\n    cat_impute_strategy=&quot;most_frequent&quot;,   # Categorical missing value imputation\n    target_encode_cols=None,               # Columns for target encoding (optional)\n    target_encode_smooth=&quot;auto&quot;,           # Smoothing for target encoding\n    drop=&quot;if_binary&quot;,                      # Drop strategy for one-hot encoding\n    sanitize_feature_names=True            # Clean up special characters in column names\n)\n\nX_train_prep = preprocessor.fit_transform(X_train)\nX_test_prep = preprocessor.transform(X_test)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">These defaults are often sufficient for quick iteration. But real-world datasets rarely fit neatly into defaults. So let&#8217;s explore some of the more nuanced preprocessing tasks the <code>PreProcessor<\/code> supports.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-575b036a5025c148717b4db057fb1124\">3.1 Managing High-Cardinality Categoricals with Target Encoding<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">High-cardinality categorical features pose a challenge: traditional one-hot encoding can result in hundreds of sparse columns. Target encoding offers a compact alternative, replacing categories with smoothed averages of the target variable. However, tuning the smoothing parameter is tricky: too little smoothing leads to overfitting, while too much dilutes useful signal.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">MLarena adopts the empirical Bayes-based approach in SKLearn&#8217;s <code>TargetEncoder<\/code> to smoothing when <code>target_encode_smooth=&quot;auto&quot;<\/code>, and also allows users to specify numeric values (see<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.TargetEncoder.html\" target=\"_blank\" rel=\"noreferrer noopener\"> doc for sklearn TargetEncoder<\/a> and <a href=\"https:\/\/dl.acm.org\/doi\/10.1145\/507533.507538\" target=\"_blank\" rel=\"noreferrer noopener\">Micci-Barrec, 2001<\/a>).<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">preprocessor = PreProcessor(\n    target_encode_cols=[&#039;city&#039;],\n    target_encode_smooth=&#039;auto&#039;\n)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To help guide this choice, the <code>plot_target_encoding_comparison<\/code> method visualizes how different smoothing values affect the encoding of rare categories. For example:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">PreProcessor.plot_target_encoding_comparison(\n    X_train, y_train,\n    target_encode_col=&#039;city&#039;,\n    smooth_params=[&#039;auto&#039;, 10, 20]\n)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/1ZN_1UhvvOglMVkacRenP1A.png\" alt=\"\" class=\"wp-image-607269\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This is especially useful for inspecting the effect on underrepresented categories (e.g., a city like &#8220;Seattle&#8221; with only 24 samples). The visualization shows that different smoothing parameters lead to marked differences in Seattle\u2019s encoded value. Such clear visuals support data specialists and domain experts in having meaningful discussions and making informed decisions on the best encoding strategy.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-52fea5800ad645e4632cf74d6edd8ea4\">3.2 Identifying and Removing Unhelpful Features<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Another common challenge is feature overload: too many variables, not all of which contribute meaningful signals. Selecting a cleaner subset can improve both performance and interpretability.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>filter_feature_selection<\/code> method helps filter out:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Features with high missingness<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Features with only one unique value<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Features with low mutual information with the target<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Here\u2019s how it works:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">filter_fs = PreProcessor.filter_feature_selection(\n    X_train,\n    y_train,\n    task=&#039;classification&#039;, # or &#039;regression&#039;\n    missing_threshold=0.2, # drop features with &gt; 20% missing values\n    mi_threshold=0.05,     # drop features with low mutual information\n)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This returns a summary like:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markdown\">Filter Feature Selection Summary:\n==========\nTotal features analyzed: 7\n\n1. High missing ratio (&gt;20.0%): 0 columns\n\n2. Single value: 1 columns\n   Columns: occupation\n\n3. Low mutual information (&lt;0.05): 3 columns\n   Columns: age, tenure, occupation\n\nRecommended drops: (3 columns in total)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The selected features can be accessed programmatically:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">selected_cols = fitler_fs[&#039;selected_cols&#039;]\nX_train_selected = X_train[selected_cols]<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/1QvNqjjdmKD-IsV1mCB9Gqg.png\" alt=\"A `analysis` table returned by `filter_feature_selection`\" class=\"wp-image-607267\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This early filter step doesn\u2019t replace full feature engineering or wrapper-based selection (which is on the roadmap), but helps reduce noise before heavier modelling begins.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-ea96dc2f7aa1273d40d0419bc8389f90\">3.3 Preventing Downstream Errors with Column Name Sanitization\u00a0<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When one-hot encoding is applied to categorical features, column names can inherit special characters, like <code>&#039;age_60+&#039;<\/code> or <code>&#039;income_&lt;$30K&#039;<\/code>. These characters can break pipelines downstream, especially during logging, deployment, or use with MLflow.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To reduce the risk of silent pipeline failures, MLarena automatically sanitizes feature names by default:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">preprocessor = PreProcessor(sanitize_feature_names=True)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Characters like <code>+<\/code>, <code>&lt;<\/code>, and % are replaced with safe alternatives as shown in the table below, improving compatibility with production-grade tooling. Users who prefer raw names can easily disable this behavior by setting <code>sanitize_feature_names=False<\/code>.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/16VFSCLBK49a09iCbOWB7dA.png\" alt=\"A table that shows the orginal and sanitized feature names\" class=\"wp-image-607268\"\/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">4. Solving Everyday Challenges in ML Practice<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In real-world machine learning projects, success goes beyond model accuracy. It often depends on how clearly we communicate results, how well our tools support stakeholder decision-making, and how reliably our pipelines handle imperfect data. MLarena includes a growing set of utilities designed to address these practical challenges. Below are just a few examples.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-2332108f8304ed34a3f491094a464e33\">4.1 Threshold Analysis for Classification Problems<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Binary classification models often output probabilities, but real-world decisions require a hard threshold to separate positives from negatives. This choice affects precision, recall, and ultimately, business outcomes. Yet in practice, thresholds are often left at the default 0.5, even when that\u2019s not aligned with domain needs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">MLarena\u2019s <code>threshold_analysis<\/code> method helps make this choice more rigorous and tailored. We can:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Customize the precision-recall balance via the beta parameter in the F-beta score<br><img decoding=\"async\" src=\"https:\/\/cdn-images-1.medium.com\/max\/800\/1*tOm-3FVI549aaxKqJCsPjg.png\" alt=\"A table that summarize how the beta value in F-beta address the tradeoff between precision and recall\"><\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Find the optimal classification threshold based on our business goals by maximizing F-beta<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Use bootstrapping or stratified k-fold cross-validation for robust, reliable estimates<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Perform threshold analysis using bootstrap method\nresults = MLPipeline.threshold_analysis(  \n    y_train,                     # True labels for training data\n    y_pred_proba,                # Predicted probabilities from model\n    beta = 0.8,                  # F-beta score parameter (weights precision more than recall)\n    method = &quot;bootstrap&quot;,        # Use bootstrap resampling for robust results\n    bootstrap_iterations=100)    # Number of bootstrap samples to generate\n\n# utilize the optimal threshold identified on new data\nbest_pipeline.evaluate(\n    X_test, y_test, beta=0.8, \n    threshold=results[&#039;optimal_threshold&#039;]\n    )\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This enables practitioners to tie model decisions more closely to domain priorities, such as catching more fraud cases (recall) or reducing false alarms in quality control (precision).<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-9f8839e589211b4d8c9f83ae54cc0ffa\">4.2 Communicating with Clarity Through Visualization<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Strong visualizations are essential not just for EDA, but for engaging stakeholders and validating findings. MLarena includes a set of plotting utilities designed for interpretability and clarity.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">4.2.1 Comparing Distributions Across Groups<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">When analyzing numerical data across distinct categories such as regions, cohorts, or treatment groups, a comprehensive understanding requires more than just central tendency metrics like mean or median. It&#8217;s crucial to also grasp the data&#8217;s dispersion and identify any outliers. To address this, the <code>plot_box_scatter<\/code> function in Mlarena overlays boxplots with jittered scatter points, providing rich distribution information within a single, intuitive visualization.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Furthermore, complementing visual insights with robust statistical analysis often proves invaluable. Therefore, the plotting function optionally integrates statistical tests such as ANOVA, Welch&#8217;s ANOVA, and Kruskal-Wallis, allowing us to annotate our plots with statistical test results, as demonstrated below.&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import mlarena.utils.plot_utils as put\n\nfig, ax, results = put.plot_box_scatter(\n    data=df,\n    x=&quot;item&quot;,\n    y=&quot;value&quot;,\n    title=&quot;Boxplot with Scatter Overlay (Demo for Crowded Data)&quot;,\n    point_size=2,\n    xlabel=&quot; &quot;\uff0c\n    stat_test=&quot;anova&quot;,      # specify a statistical test\n    show_stat_test=True\n    )<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/1KgDYKzLd8-FhkmfctjTXEQ.png\" alt=\"plot created by the `plot_box_scatter` function where the distribution of 8 categories were visualized\" class=\"wp-image-607270\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">There are many ways to customize the plot\u200a\u2014\u200aeither by modifying the returned <code>ax<\/code> object or using built-in function parameters. For example, you can color the points by another variable using the <code>point_hue<\/code> parameter.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">fig, ax = put.plot_box_scatter(\n    data=df,\n    x=&quot;group&quot;,\n    y=&quot;value&quot;,\n    point_hue=&quot;source&quot;, # color points by source\n    point_alpha=0.5,\n    title=&quot;Boxplot with Scatter Overlay (Demo for Point Hue)&quot;,\n)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/1-xkyhsZH-bdqpqq2fwmVjg.png\" alt=\"A plot created by `plot_box_scatter` function where the points were colored by a third variable\" class=\"wp-image-607262\"\/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">4.2.2 Visualizing Temporal Distribution<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Data specialists and domain experts frequently need to observe how the distribution of a continuous variable evolves over time to spot critical shifts, emerging trends, or anomalies.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This often involves boilerplate tasks like aggregating data by desired time granularity (hourly, weekly, monthly, etc.), ensuring correct chronological order, and customizing appearances, such as coloring points by a third variable of interest. Our <code>plot_distribution_over_time<\/code> function handles these complexities with ease. <\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># automatically group data and format X-axis lable by specified granularity\nfig, ax = put.plot_distribution_over_time(\n    data=df,\n    x=&#039;timestamp&#039;,\n    y=&#039;heart_rate&#039;,\n    freq=&#039;h&#039;,                                   # specify granularity\n    point_hue=None,                             # set a variable to color points if desired\n    title=&#039;Heart Rate Distribution Over Time&#039;,\n    xlabel=&#039; &#039;,\n    ylabel=&#039;Heart Rate (bpm)&#039;,\n)<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/1nMNT6B9XPwwjRavHXJOEuA.png\" alt=\"A plot created by the `plot_distribution_over_time` function\" class=\"wp-image-607264\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">More demos of plotting functions and examples are available in the <a href=\"https:\/\/github.com\/MenaWANG\/mlarena\/blob\/master\/examples\/3.utils_plot.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">plot_utils documentation<\/a>\ud83d\udd17.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-text-primary-color has-text-color has-link-color wp-elements-ce5d6104e8065027fbf2e9c41572aeac\">4.3 Data Utilities<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you\u2019re like me, you probably spend a lot of time cleaning and troubleshooting data before getting to the fun parts of machine learning. \ud83d\ude05 Real-world data is often messy, inconsistent, and full of surprises. That\u2019s why MLarena includes a growing collection of <code>data_utils<\/code> functions to simplify and streamline our EDA and data preparation process.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">4.3.1 Cleaning Up Inconsistent Date Formats<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Date columns don\u2019t always arrive in clean, ISO formats, and inconsistent casing or formats can be a real headache. The <code>transform_date_cols<\/code> function helps standardize date columns for downstream analysis, even when values have irregular formats like:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import mlarena.utils.data_utils as dut\n\ndf_raw = pd.DataFrame({\n    ...\n    &quot;date&quot;: [&quot;25Aug2024&quot;, &quot;15OCT2024&quot;, &quot;01Dec2024&quot;],  # inconsistent casing\n})\n\n# transformed the specified date columns\ndf_transformed = dut.transform_date_cols(df_raw, &#039;date&#039;, &quot;%d%b%Y&quot;)\ndf_transformed[&#039;date&#039;]\n# 0   2024-08-25\n# 1   2024-10-15\n# 2   2024-12-01<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">It automatically handles case variations and converts the column into proper datetime objects.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you sometimes forget the Python date format codes or mix them up with Spark&#8217;s, you\u2019re not alone \ud83d\ude01. Just check the function&#8217;s docstring for a quick refresher. <\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">?dut.transform_date_cols  # check for docstring<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markdown\">Signature:\n----------\ndut.transform_date_cols(\n    data: pandas.core.frame.DataFrame,\n    date_cols: Union[str, List[str]],\n    str_date_format: str = &#039;%Y%m%d&#039;,\n) -&gt; pandas.core.frame.DataFrame\nDocstring:\nTransforms specified columns in a Pandas DataFrame to datetime format.\n\nParameters\n----------\ndata : pd.DataFrame\n    The input DataFrame.\ndate_cols : Union[str, List[str]]\n    A column name or list of column names to be transformed to dates.\nstr_date_format : str, default=&quot;%Y%m%d&quot;\n    The string format of the dates, using Python&#039;s `strftime`\/`strptime` directives.\n    Common directives include:\n        %d: Day of the month as a zero-padded decimal (e.g., 25)\n        %m: Month as a zero-padded decimal number (e.g., 08)\n        %b: Abbreviated month name (e.g., Aug)\n        %B: Full month name (e.g., August)\n        %Y: Four-digit year (e.g., 2024)\n        %y: Two-digit year (e.g., 24)<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">4.3.2 Verifying Primary Keys in Messy Data<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Identifying a valid primary key can be challenging in real-world, messy datasets. While a traditional primary key must inherently be unique across all rows and contain no missing values, potential key columns often contain nulls, particularly in the early stages of a data pipeline. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The <code>is_primary_key<\/code> function adopts a pragmatic approach to this challenge: it alerts user to any missing values within potential key columns and then verifies if the remaining non-null rows are uniquely identifiable.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is useful for:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">&#8211; <strong>Data quality assessment<\/strong>: Quickly assess the completeness and uniqueness of our key fields.<br>&#8211; <strong>Join readiness<\/strong>: Identify reliable keys for merging datasets, even when some values are initially missing.<br>&#8211; <strong>ETL validation<\/strong>: Verify key constraints while accounting for real-world data imperfections.<br>&#8211; <strong>Schema design<\/strong>: Inform robust database schema planning with insights derived from actual data key characteristics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As such, <code>is_primary_key<\/code> is particularly valuable for designing resilient data pipelines in less-than-perfect data environments. It supports both single and composite keys by accepting either a column name or a list of columns.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">df = pd.DataFrame({\n    # Single column primary key\n    &#039;id&#039;: [1, 2, 3, 4, 5],    \n    # Column with duplicates\n    &#039;category&#039;: [&#039;A&#039;, &#039;B&#039;, &#039;A&#039;, &#039;B&#039;, &#039;C&#039;],    \n    # Date column with some duplicates\n    &#039;date&#039;: [&#039;2024-01-01&#039;, &#039;2024-01-01&#039;, &#039;2024-01-02&#039;, &#039;2024-01-02&#039;, &#039;2024-01-03&#039;],\n    # Column with null values\n    &#039;code&#039;: [&#039;X1&#039;, None, &#039;X3&#039;, &#039;X4&#039;, &#039;X5&#039;],    \n    # Values column\n    &#039;value&#039;: [100, 200, 300, 400, 500]\n})\n\nprint(&quot;\\nTest 1: Column with duplicates&quot;)\ndut.is_primary_key(df, [&#039;category&#039;])  # Should return False\n\nprint(&quot;\\nTest 2: Column with null values&quot;)\ndut.is_primary_key(df, [&#039;code&#039;,&#039;date&#039;]) # Should return True<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markdown\">Test 1: Column with duplicates\n\u2705 There are no missing values in column &#039;category&#039;.\n\u2139\ufe0f Total row count after filtering out missings: 5\n\u2139\ufe0f Unique row count after filtering out missings: 3\n\u274c The column(s) &#039;category&#039; do not form a primary key.\n\nTest 2: Column with null values\n\u26a0\ufe0f There are 1 row(s) with missing values in column &#039;code&#039;.\n\u2705 There are no missing values in column &#039;date&#039;.\n\u2139\ufe0f Total row count after filtering out missings: 4\n\u2139\ufe0f Unique row count after filtering out missings: 4\n\ud83d\udd11 The column(s) &#039;code&#039;, &#039;date&#039; form a primary key after removing rows with missing values.<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">Beyond what we&#8217;ve covered, the <code>data_utils<\/code> module offers other handy utilities, including a dedicated set of three functions for the &#8220;Discover \u2192 Investigate \u2192 Resolve&#8221; deduplication workflow, where <code>is_primary_key<\/code> discussed above, serves as the initial step. More details are available in the <a href=\"https:\/\/github.com\/MenaWANG\/mlarena\/blob\/master\/examples\/3.utils_data.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\">data_utils demo<\/a>\ud83d\udd17.&nbsp;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">And there you have it \u2014 an introduction to the MLarena package. My hope is that these tools prove as valuable for streamlining your machine learning workflows as they have been for mine. This is an open-source, not-for-profit initiative. Please don&#8217;t hesitate to reach out if you have any questions or would like to request new features. I\u2019d love to hear from you! \ud83e\udd17<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Stay tuned, and follow me on <a href=\"https:\/\/menawang.medium.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Medium<\/a>. \ud83d\ude01<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\ud83d\udcbc<a href=\"https:\/\/www.linkedin.com\/in\/mena-ning-wang\/\" rel=\"noreferrer noopener\" target=\"_blank\">LinkedIn<\/a> | \ud83d\ude3a<a href=\"https:\/\/github.com\/MenaWANG\" rel=\"noreferrer noopener\" target=\"_blank\">GitHub<\/a> | \ud83d\udd4a\ufe0f<a href=\"https:\/\/x.com\/mena_wang\" rel=\"noreferrer noopener\" target=\"_blank\">Twitter\/X<\/a><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">Unless otherwise noted, all images are by the author.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The framework is now an open-source Python package for streamlined ML workflows<\/p>\n","protected":false},"author":18,"featured_media":606510,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"is_member_only":false,"sub_heading":"The framework is now an open-source Python package for streamlined ML workflows","footnotes":""},"categories":[22],"tags":[448,1207,468,588,1078],"sponsor":[],"coauthors":[30027],"class_list":["post-606509","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-data-science","tag-databricks","tag-deep-dives","tag-mlflow","tag-mlops"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Build Algorithm-Agnostic ML Pipelines in a\u00a0Breeze | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Build Algorithm-Agnostic ML Pipelines in a\u00a0Breeze | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"The framework is now an open-source Python package for streamlined ML workflows\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2025-07-07T19:49:04+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-07-07T19:49:27+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/image-161.png\" \/>\n\t<meta property=\"og:image:width\" content=\"800\" \/>\n\t<meta property=\"og:image:height\" content=\"520\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Mena Wang\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Mena Wang\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"16 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Build Algorithm-Agnostic ML Pipelines in a\u00a0Breeze\",\"datePublished\":\"2025-07-07T19:49:04+00:00\",\"dateModified\":\"2025-07-07T19:49:27+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/\"},\"wordCount\":2930,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/image-161.png\",\"keywords\":[\"Data Science\",\"Databricks\",\"Deep Dives\",\"Mlflow\",\"Mlops\"],\"articleSection\":[\"Machine Learning\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/\",\"url\":\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/\",\"name\":\"Build Algorithm-Agnostic ML Pipelines in a\u00a0Breeze | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/image-161.png\",\"datePublished\":\"2025-07-07T19:49:04+00:00\",\"dateModified\":\"2025-07-07T19:49:27+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/image-161.png\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/image-161.png\",\"width\":800,\"height\":520,\"caption\":\"Photo by Photoholgic on Unsplash\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Build Algorithm-Agnostic ML Pipelines in a\u00a0Breeze\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Build Algorithm-Agnostic ML Pipelines in a\u00a0Breeze | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/","og_locale":"en_US","og_type":"article","og_title":"Build Algorithm-Agnostic ML Pipelines in a\u00a0Breeze | Towards Data Science","og_description":"The framework is now an open-source Python package for streamlined ML workflows","og_url":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/","og_site_name":"Towards Data Science","article_published_time":"2025-07-07T19:49:04+00:00","article_modified_time":"2025-07-07T19:49:27+00:00","og_image":[{"width":800,"height":520,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/image-161.png","type":"image\/png"}],"author":"Mena Wang","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Mena Wang","Est. reading time":"16 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Build Algorithm-Agnostic ML Pipelines in a\u00a0Breeze","datePublished":"2025-07-07T19:49:04+00:00","dateModified":"2025-07-07T19:49:27+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/"},"wordCount":2930,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/image-161.png","keywords":["Data Science","Databricks","Deep Dives","Mlflow","Mlops"],"articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/","url":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/","name":"Build Algorithm-Agnostic ML Pipelines in a\u00a0Breeze | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/image-161.png","datePublished":"2025-07-07T19:49:04+00:00","dateModified":"2025-07-07T19:49:27+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/image-161.png","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/image-161.png","width":800,"height":520,"caption":"Photo by Photoholgic on Unsplash"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/build-algorithm-agnostic-ml-pipelines-in-a-breeze\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Build Algorithm-Agnostic ML Pipelines in a\u00a0Breeze"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"TDS Contributor Portal","distributor_original_site_url":"https:\/\/contributor.insightmediagroup.io","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/606509","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=606509"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/606509\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/606510"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=606509"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=606509"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=606509"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=606509"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=606509"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}