Llm Evaluation | Towards Data Science

Evaluation-Driven Development for LLM-Powered Products: Lessons from Building in Healthcare

Large Language Models

How metrics and monitoring combine with human expertise to build trustworthy AI in healthcare.

Robert Martin-Short

July 10, 2025

30 min read

LLM-as-a-Judge: A Practical Guide

Large Language Models

How to Scale LLM Evaluations Beyond Manual Review

Shuai Guo

June 19, 2025

16 min read

Evaluating LLMs for Inference, or Lessons from Teaching for Machine Learning

Large Language Models

It’s like grading papers, but your student is an LLM

Stephanie Kirmer

June 2, 2025

12 min read

LLM Evaluations: from Prototype to Production

Artificial Intelligence

How to monitor the quality of your LLM product

Mariya Mansurova

April 25, 2025

30 min read

Image licensed from elements.envato.com, edit by Marcel Müller, 2025

Why Generative-AI Apps’ Quality Often Sucks and What to Do About It

How to get from PoCs to tested high-quality applications in production

Dr. Marcel Müller

January 20, 2025

23 min read

Image source: Flux 1. Pro model prompted for "robot evaluating other robots"

Unsupervised LLM Evaluations

Machine Learning

Practitioners guide to judging outputs of large language models

Daniel Kharitonov

November 2, 2024

14 min read

Source: Generated with the help of AI (OpenAI's Dall-E model)

Evaluating performance of LLM-based Applications

Evaluation Framework for real-world requirements

Anurag Bhagat

September 30, 2024

9 min read

GPT playing the Forehead Detective game | Image by DALL·E

GPTs and the Forehead Detective

Artificial Intelligence

Are the reasoning capabilities of OpenAI LLMs good enough to play the classic guessing game?

Krzysztof K. Zdeb

September 5, 2024

14 min read

Visualising AI project launched by Google DeepMind. From Unsplash image.

Advanced Retrieval Techniques in a World of 2M Token Context Windows: Part 2 on Re-rankers

Exploring RAG techniques to improve retrieval accuracy

Meghan Heintz

August 26, 2024

8 min read

Continuous Improvement Framework for LLM Application's Evaluation with Reference-free Approach - Image by Author

Judge an LLM Judge: A Dual-Layer Evaluation Framework for Continuous Improvement of LLM Evaluation

Machine Learning

Can “the evaluation of an LLM application by an LLM judge” be audited by another…

Daniel Khoa Le

July 17, 2024

13 min read