Model Evaluation | Towards Data Science https://towardsdatascience.com/tag/model-evaluation/ Publish AI, ML & data-science insights to a global community of data professionals. Tue, 15 Jul 2025 00:41:53 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Model Evaluation | Towards Data Science https://towardsdatascience.com/tag/model-evaluation/ 32 32 Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need https://towardsdatascience.com/accuracy-is-dead-calibration-discrimination-and-other-metrics-you-actually-need/ Tue, 15 Jul 2025 00:41:39 +0000 https://towardsdatascience.com/?p=606583 A deep dive into advanced evaluation for data scientists

The post Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need appeared first on Towards Data Science.

]]>
How to Evaluate LLMs and Algorithms — The Right Way https://towardsdatascience.com/how-to-evaluate-llms-and-algorithms-the-right-way/ Fri, 23 May 2025 14:02:00 +0000 https://towardsdatascience.com/?p=606070 This week, we focus on the best strategies for evaluating and benchmarking the performance of ML approaches

The post How to Evaluate LLMs and Algorithms — The Right Way appeared first on Towards Data Science.

]]>
Agentic AI 102: Guardrails and Agent Evaluation https://towardsdatascience.com/agentic-ai-102-guardrails-and-agent-evaluation/ Fri, 16 May 2025 19:09:26 +0000 https://towardsdatascience.com/?p=606037 An introduction to tools that make your model safer and more predictable and performant.

The post Agentic AI 102: Guardrails and Agent Evaluation appeared first on Towards Data Science.

]]>
How To Build a Benchmark for Your Models https://towardsdatascience.com/how-to-build-a-benchmark-for-your-models/ Thu, 15 May 2025 20:15:00 +0000 https://towardsdatascience.com/?p=606029 The Importance of Building a Benchmark and How To Do It

The post How To Build a Benchmark for Your Models appeared first on Towards Data Science.

]]>
Attaining LLM Certainty with AI Decision Circuits https://towardsdatascience.com/attaining-llm-certainty-with-ai-decision-circuits/ Fri, 02 May 2025 19:01:38 +0000 https://towardsdatascience.com/?p=605893 Uncertainty is nothing new in technology  —  all modern systems overcome uncertain inputs and outputs with mathematically proven control structures.

The post Attaining LLM Certainty with AI Decision Circuits appeared first on Towards Data Science.

]]>
Choose the Right One: Evaluating Topic Models for Business Intelligence https://towardsdatascience.com/choose-the-right-one-evaluating-topic-models-for-business-intelligence/ Thu, 24 Apr 2025 19:50:50 +0000 https://towardsdatascience.com/?p=605801 Python tutorial for evaluating top-notch bigram topic models in customer email classification

The post Choose the Right One: Evaluating Topic Models for Business Intelligence appeared first on Towards Data Science.

]]>
Learnings from a Machine Learning Engineer — Part 3: The Evaluation https://towardsdatascience.com/learnings-from-a-machine-learning-engineer-part-3-the-evaluation/ Thu, 13 Feb 2025 21:00:06 +0000 https://towardsdatascience.com/?p=597857 In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions). In Part 1, […]

The post Learnings from a Machine Learning Engineer — Part 3: The Evaluation appeared first on Towards Data Science.

]]>
How to Measure the Reliability of a Large Language Model’s Response https://towardsdatascience.com/how-to-measure-the-reliability-of-a-large-language-models-response/ Thu, 13 Feb 2025 02:11:41 +0000 https://towardsdatascience.com/?p=597790 The basic principle of Large Language Models (LLMs) is very simple: to predict the next word (or token) in a sequence of words based on statistical patterns in their training data. However, this seemingly simple capability turns out to be incredibly sophisticated when it can do a number of amazing tasks such as text summarization, […]

The post How to Measure the Reliability of a Large Language Model’s Response appeared first on Towards Data Science.

]]>
Understanding Model Calibration: A Gentle Introduction & Visual Exploration https://towardsdatascience.com/understanding-model-calibration-a-gentle-introduction-visual-exploration/ Tue, 11 Feb 2025 22:00:41 +0000 https://towardsdatascience.com/?p=597690 How Reliable Are Your Predictions? About To be considered reliable, a model must be calibrated so that its confidence in each decision closely reflects its true outcome. In this blog post we’ll take a look at the most commonly used definition for calibration and then dive into a frequently used evaluation measure for model calibration. […]

The post Understanding Model Calibration: A Gentle Introduction & Visual Exploration appeared first on Towards Data Science.

]]>
What Exactly Is an “Eval” and Why Should Product Managers Care? https://towardsdatascience.com/what-exactly-is-an-eval-and-why-should-product-managers-care-b596dca275a7/ Thu, 25 Jul 2024 16:40:04 +0000 https://towardsdatascience.com/what-exactly-is-an-eval-and-why-should-product-managers-care-b596dca275a7/ How to stop worrying and love the data

The post What Exactly Is an “Eval” and Why Should Product Managers Care? appeared first on Towards Data Science.

]]>
How to Evaluate Your Predictions https://towardsdatascience.com/how-to-evaluate-your-predictions-cef80d8f6a69/ Fri, 17 May 2024 06:59:57 +0000 https://towardsdatascience.com/how-to-evaluate-your-predictions-cef80d8f6a69/ Be mindful of the measure you choose

The post How to Evaluate Your Predictions appeared first on Towards Data Science.

]]>
Interpreting R²: a Narrative Guide for the Perplexed https://towardsdatascience.com/interpreting-r%c2%b2-a-narrative-guide-for-the-perplexed-086a9a69c1ec/ Mon, 19 Feb 2024 21:57:28 +0000 https://towardsdatascience.com/interpreting-r%c2%b2-a-narrative-guide-for-the-perplexed-086a9a69c1ec/ An accessible walkthrough of fundamental properties of this popular, yet often misunderstood metric from a predictive modeling perspective

The post Interpreting R²: a Narrative Guide for the Perplexed appeared first on Towards Data Science.

]]>
Exploring mergekit for Model Merge, AutoEval for Model Evaluation, and DPO for Model Fine-tuning https://towardsdatascience.com/exploring-mergekit-for-model-merge-and-autoeval-for-model-evaluation-c681766fd1f3/ Fri, 19 Jan 2024 17:12:55 +0000 https://towardsdatascience.com/exploring-mergekit-for-model-merge-and-autoeval-for-model-evaluation-c681766fd1f3/ My observations from experimenting with model merge, evaluation, and two model fine-tuning techniques

The post Exploring mergekit for Model Merge, AutoEval for Model Evaluation, and DPO for Model Fine-tuning appeared first on Towards Data Science.

]]>
Understanding AUC Scores in Depth: What’s the Point? https://towardsdatascience.com/understanding-auc-scores-in-depth-whats-the-point-5f2505eb499f/ Sat, 02 Sep 2023 15:24:15 +0000 https://towardsdatascience.com/understanding-auc-scores-in-depth-whats-the-point-5f2505eb499f/ Exploring alternative metrics alongside for deeper insights

The post Understanding AUC Scores in Depth: What’s the Point? appeared first on Towards Data Science.

]]>
How to Compare ML Solutions Effectively? https://towardsdatascience.com/how-to-compare-ml-solutions-effectively-28384e2cbca1/ Thu, 06 Jul 2023 17:34:39 +0000 https://towardsdatascience.com/how-to-compare-ml-solutions-effectively-28384e2cbca1/ How to Compare ML Solutions Effectively When evaluating and comparing machine learning solutions, your first go-to evaluation metric will probably be predictive power. It’s easy to compare different models with one single metric, and this is perfectly fine in Kaggle competitions. In real life, the situation is different. Imagine two models: one model that uses […]

The post How to Compare ML Solutions Effectively? appeared first on Towards Data Science.

]]>
How to Evaluate the Performance of Your ML/ AI Models https://towardsdatascience.com/how-to-evaluate-the-performance-of-your-ml-ai-models-ba1debc6f2fa/ Sat, 20 May 2023 04:29:45 +0000 https://towardsdatascience.com/how-to-evaluate-the-performance-of-your-ml-ai-models-ba1debc6f2fa/ An accurate evaluation is the only way to performance improvement

The post How to Evaluate the Performance of Your ML/ AI Models appeared first on Towards Data Science.

]]>
CRPS – A Scoring Function for Bayesian Machine Learning Models https://towardsdatascience.com/crps-a-scoring-function-for-bayesian-machine-learning-models-dd55a7a337a8/ Sat, 28 Jan 2023 01:27:52 +0000 https://towardsdatascience.com/crps-a-scoring-function-for-bayesian-machine-learning-models-dd55a7a337a8/ The Continuous Ranked Probability Score is a scoring function suitable for Bayesian ML models

The post CRPS – A Scoring Function for Bayesian Machine Learning Models appeared first on Towards Data Science.

]]>
Does Autocorrect Make Life Better? https://towardsdatascience.com/does-autocorrect-make-life-better-aebafbed09f0/ Sat, 08 Oct 2022 06:37:40 +0000 https://towardsdatascience.com/does-autocorrect-make-life-better-aebafbed09f0/ A cautionary tale of systemic machine learning failure

The post Does Autocorrect Make Life Better? appeared first on Towards Data Science.

]]>
Building ML Models That Are Useless https://towardsdatascience.com/building-ml-models-that-are-useless-ba70df6957d/ Tue, 15 Feb 2022 20:06:31 +0000 https://towardsdatascience.com/building-ml-models-that-are-useless-ba70df6957d/ So you built an ML Model. Nice! Is it actually going to work? Not just in the Kaggle leaderboard sense, but in the literal staking-lives-on-this sense? The difference between these two was resoundingly demonstrated in a 2021 Nature paper [1] showing that, of all the recently published Covid-19 diagnostic AI models, none of them strongly […]

The post Building ML Models That Are Useless appeared first on Towards Data Science.

]]>
Multi-dimensional Decision Boundary : why current approaches fail and how to make it work https://towardsdatascience.com/multi-dimensional-machine-learning-decision-boundary-how-to-get-it-to-work-7122dca3b3a/ Thu, 13 Jan 2022 18:13:36 +0000 https://towardsdatascience.com/multi-dimensional-machine-learning-decision-boundary-how-to-get-it-to-work-7122dca3b3a/ The decision boundary is a very important visual tool for model evaluation. See how to get it to work on complex datasets

The post Multi-dimensional Decision Boundary : why current approaches fail and how to make it work appeared first on Towards Data Science.

]]>