Model Evaluation | Towards Data Science

Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need

Pol Marin — Tue, 15 Jul 2025 00:41:39 +0000

A deep dive into advanced evaluation for data scientists

The post Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need appeared first on Towards Data Science.

How to Evaluate LLMs and Algorithms — The Right Way

TDS Editors — Fri, 23 May 2025 14:02:00 +0000

This week, we focus on the best strategies for evaluating and benchmarking the performance of ML approaches

The post How to Evaluate LLMs and Algorithms — The Right Way appeared first on Towards Data Science.

Agentic AI 102: Guardrails and Agent Evaluation

Gustavo Santos — Fri, 16 May 2025 19:09:26 +0000

An introduction to tools that make your model safer and more predictable and performant.

The post Agentic AI 102: Guardrails and Agent Evaluation appeared first on Towards Data Science.

How To Build a Benchmark for Your Models

Lorenzo Mezzini — Thu, 15 May 2025 20:15:00 +0000

The Importance of Building a Benchmark and How To Do It

The post How To Build a Benchmark for Your Models appeared first on Towards Data Science.

Attaining LLM Certainty with AI Decision Circuits

James Barney — Fri, 02 May 2025 19:01:38 +0000

Uncertainty is nothing new in technology — all modern systems overcome uncertain inputs and outputs with mathematically proven control structures.

The post Attaining LLM Certainty with AI Decision Circuits appeared first on Towards Data Science.

Choose the Right One: Evaluating Topic Models for Business Intelligence

Petr Koráb — Thu, 24 Apr 2025 19:50:50 +0000

Python tutorial for evaluating top-notch bigram topic models in customer email classification

The post Choose the Right One: Evaluating Topic Models for Business Intelligence appeared first on Towards Data Science.

Learnings from a Machine Learning Engineer — Part 3: The Evaluation

David Martin — Thu, 13 Feb 2025 21:00:06 +0000

In this third part of my series, I will explore the evaluation process which is a critical piece that will lead to a cleaner data set and elevate your model performance. We will see the difference between evaluation of a trained model (one not yet in production), and evaluation of a deployed model (one making real-world predictions). In Part 1, […]

The post Learnings from a Machine Learning Engineer — Part 3: The Evaluation appeared first on Towards Data Science.

How to Measure the Reliability of a Large Language Model’s Response

Umair Ali Khan — Thu, 13 Feb 2025 02:11:41 +0000

The basic principle of Large Language Models (LLMs) is very simple: to predict the next word (or token) in a sequence of words based on statistical patterns in their training data. However, this seemingly simple capability turns out to be incredibly sophisticated when it can do a number of amazing tasks such as text summarization, […]

The post How to Measure the Reliability of a Large Language Model’s Response appeared first on Towards Data Science.

Understanding Model Calibration: A Gentle Introduction & Visual Exploration

Maja Pavlovic — Tue, 11 Feb 2025 22:00:41 +0000

How Reliable Are Your Predictions? About To be considered reliable, a model must be calibrated so that its confidence in each decision closely reflects its true outcome. In this blog post we’ll take a look at the most commonly used definition for calibration and then dive into a frequently used evaluation measure for model calibration. […]

The post Understanding Model Calibration: A Gentle Introduction & Visual Exploration appeared first on Towards Data Science.

What Exactly Is an “Eval” and Why Should Product Managers Care?

Julia Winn — Thu, 25 Jul 2024 16:40:04 +0000

How to stop worrying and love the data

The post What Exactly Is an “Eval” and Why Should Product Managers Care? appeared first on Towards Data Science.

How to Evaluate Your Predictions

Jeffrey Näf — Fri, 17 May 2024 06:59:57 +0000

Be mindful of the measure you choose

The post How to Evaluate Your Predictions appeared first on Towards Data Science.

Interpreting R²: a Narrative Guide for the Perplexed

Roberta Rocca — Mon, 19 Feb 2024 21:57:28 +0000

An accessible walkthrough of fundamental properties of this popular, yet often misunderstood metric from a predictive modeling perspective

The post Interpreting R²: a Narrative Guide for the Perplexed appeared first on Towards Data Science.

Exploring mergekit for Model Merge, AutoEval for Model Evaluation, and DPO for Model Fine-tuning

Wenqi Glantz — Fri, 19 Jan 2024 17:12:55 +0000

My observations from experimenting with model merge, evaluation, and two model fine-tuning techniques

The post Exploring mergekit for Model Merge, AutoEval for Model Evaluation, and DPO for Model Fine-tuning appeared first on Towards Data Science.

Understanding AUC Scores in Depth: What’s the Point?

Maham Haroon — Sat, 02 Sep 2023 15:24:15 +0000

Exploring alternative metrics alongside for deeper insights

The post Understanding AUC Scores in Depth: What’s the Point? appeared first on Towards Data Science.

How to Compare ML Solutions Effectively?

Hennie de Harder — Thu, 06 Jul 2023 17:34:39 +0000

How to Compare ML Solutions Effectively When evaluating and comparing machine learning solutions, your first go-to evaluation metric will probably be predictive power. It’s easy to compare different models with one single metric, and this is perfectly fine in Kaggle competitions. In real life, the situation is different. Imagine two models: one model that uses […]

The post How to Compare ML Solutions Effectively? appeared first on Towards Data Science.

How to Evaluate the Performance of Your ML/ AI Models

Sara A. Metwalli — Sat, 20 May 2023 04:29:45 +0000

An accurate evaluation is the only way to performance improvement

The post How to Evaluate the Performance of Your ML/ AI Models appeared first on Towards Data Science.

CRPS – A Scoring Function for Bayesian Machine Learning Models

Itamar Faran — Sat, 28 Jan 2023 01:27:52 +0000

The Continuous Ranked Probability Score is a scoring function suitable for Bayesian ML models

The post CRPS – A Scoring Function for Bayesian Machine Learning Models appeared first on Towards Data Science.

Does Autocorrect Make Life Better?

John Hawkins — Sat, 08 Oct 2022 06:37:40 +0000

A cautionary tale of systemic machine learning failure

The post Does Autocorrect Make Life Better? appeared first on Towards Data Science.

Building ML Models That Are Useless

Michael Potter — Tue, 15 Feb 2022 20:06:31 +0000

So you built an ML Model. Nice! Is it actually going to work? Not just in the Kaggle leaderboard sense, but in the literal staking-lives-on-this sense? The difference between these two was resoundingly demonstrated in a 2021 Nature paper [1] showing that, of all the recently published Covid-19 diagnostic AI models, none of them strongly […]

The post Building ML Models That Are Useless appeared first on Towards Data Science.

Multi-dimensional Decision Boundary : why current approaches fail and how to make it work

Pranay Dave — Thu, 13 Jan 2022 18:13:36 +0000

The decision boundary is a very important visual tool for model evaluation. See how to get it to work on complex datasets

The post Multi-dimensional Decision Boundary : why current approaches fail and how to make it work appeared first on Towards Data Science.