{"id":5708,"date":"2023-06-16T18:39:53","date_gmt":"2023-06-16T18:39:53","guid":{"rendered":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/"},"modified":"2025-01-08T20:25:38","modified_gmt":"2025-01-08T20:25:38","slug":"beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/","title":{"rendered":"Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><em>Authors: Chris Mauck, Jonas Mueller<\/em><\/p>\n<p class=\"wp-block-paragraph\">Reliable <strong>model evaluation<\/strong> lies at the heart of MLops and LLMops, guiding crucial decisions like which model or prompt to deploy (and whether to deploy at all). In this article, we prompt the FLAN-T5 LLM from <a href=\"https:\/\/ai.googleblog.com\/2021\/10\/introducing-flan-more-generalizable.html\">Google Research<\/a> with various prompts in an effort to classify text as polite or impolite. Amongst the prompt candidates, we find the prompts that appear to perform best based on observed test accuracy are often <em>actually<\/em> <em>worse<\/em> than other prompt candidates. A closer review of the test data reveals this is due to unreliable annotations. <strong>In real-world applications, you may choose suboptimal prompts for your LLM (or make other suboptimal choices guided by model evaluation) unless you clean your test data to ensure it is reliable.<\/strong><\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"dfe1e0\" data-has-transparency=\"false\" style=\"--dominant-color: #dfe1e0;\" loading=\"lazy\" decoding=\"async\" width=\"2500\" height=\"1407\" class=\"wp-image-320523 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1Yn-WMRBh-W4dyNBUqErjuw.png\" alt=\"Selecting great prompts is essential for ensuring accurate responses from Large Language Models.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1Yn-WMRBh-W4dyNBUqErjuw.png 2500w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1Yn-WMRBh-W4dyNBUqErjuw-300x169.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1Yn-WMRBh-W4dyNBUqErjuw-1024x576.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1Yn-WMRBh-W4dyNBUqErjuw-768x432.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1Yn-WMRBh-W4dyNBUqErjuw-1536x864.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1Yn-WMRBh-W4dyNBUqErjuw-2048x1153.png 2048w\" sizes=\"auto, (max-width: 2500px) 100vw, 2500px\" \/><figcaption class=\"wp-element-caption\">Selecting great prompts is essential for ensuring accurate responses from Large Language Models.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">While the harms of noisy annotations are well-characterized in training data, this article demonstrates their often-overlooked consequences in test data.<\/p>\n<p class=\"wp-block-paragraph\">I am currently a data scientist at <a href=\"https:\/\/cleanlab.ai\/\">Cleanlab<\/a> and I&#8217;m excited to share the importance of (and how you can ensure) high-quality test data to ensure optimal LLM prompt selection.<\/p>\n<h2 class=\"wp-block-heading\">Overview<\/h2>\n<p class=\"wp-block-paragraph\">You can download the data <a href=\"https:\/\/s.cleanlab.ai\/stanford-politeness-prompt-selection.csv\">here<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">This article studies a binary classification variant of the <a href=\"https:\/\/convokit.cornell.edu\/documentation\/wiki_politeness.html\">Stanford Politeness Dataset<\/a> (used under <a href=\"https:\/\/creativecommons.org\/licenses\/by\/4.0\/\">CC BY license v4.0<\/a>), which has text phrases labeled as <em>polite<\/em> or <em>impolite<\/em>. We evaluate models using a fixed test dataset containing 700 phrases.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e8e8e8\" data-has-transparency=\"false\" style=\"--dominant-color: #e8e8e8;\" loading=\"lazy\" decoding=\"async\" width=\"1324\" height=\"340\" class=\"wp-image-320525 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1LRcNgU4pO_C21ym4lkZvKg.png\" alt=\"Snapshot of the dataset showing the text and ground truth politeness label.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1LRcNgU4pO_C21ym4lkZvKg.png 1324w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1LRcNgU4pO_C21ym4lkZvKg-300x77.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1LRcNgU4pO_C21ym4lkZvKg-1024x263.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1LRcNgU4pO_C21ym4lkZvKg-768x197.png 768w\" sizes=\"auto, (max-width: 1324px) 100vw, 1324px\" \/><figcaption class=\"wp-element-caption\">Snapshot of the dataset showing the text and ground truth politeness label.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">It is standard practice to evaluate how &quot;good&quot; a classification model is by measuring the accuracy of its predictions against the given labels for examples the model did not see during training, usually referred to as &quot;test&quot;, &quot;evaluation&quot;, or &quot;validation&quot; data. This provides a numerical metric to gauge how good model A is against model B &#8211; if model A displays higher test accuracy, we estimate it to be the better model and would choose to deploy it over model B. Beyond model selection, the same decision-making framework can be applied to other choices like whether to use: hyperparameter-setting A or B, prompt A or B, feature-set A or B, etc.<\/p>\n<p class=\"wp-block-paragraph\">A <a href=\"https:\/\/datasets-benchmarks-proceedings.neurips.cc\/paper\/2021\/file\/f2217062e9a397a1dca429e7d70bc6ca-Paper-round1.pdf\">common problem<\/a> in real-world test data is some examples have incorrect labels, whether due to human annotation error, data processing error, sensor noise, etc. In such cases, test accuracy becomes a less reliable indicator of the <strong>relative performance<\/strong> between model A and model B. Let&#8217;s use a very simple example to illustrate this. Imagine your test dataset has two examples of <em>impolite<\/em> text, but <strong>unknowingly to you<\/strong>, they are (mis)labeled as <em>polite<\/em>. For instance, in our Stanford Politeness dataset, we see an actual human annotator mistakenly labeled this text &quot;<em>Are you crazy down here?! What the heck is going on?<\/em>&quot; as <em>polite<\/em> when the language is clearly agitated. Now your job is to pick the best model to classify these examples. Model A says that both examples are <em>impolite<\/em> and model B says both examples are <em>polite<\/em>. Based on these (incorrect) labels, model A scores 0% while model B scores 100% &#8211; you pick model B to deploy! But wait, which model <em>actually<\/em> is stronger?<\/p>\n<p class=\"wp-block-paragraph\">Although these implications are trivial and many are aware that real-world data is full of labeling errors, folks often focus only on noisy labels in their training data, forgetting to carefully curate their test data even though it guides crucial decisions. Using real data, this article illustrates the importance of high-quality test data to guide the choice of LLM prompts and demonstrates one way to easily improve data quality via algorithmic techniques.<\/p>\n<h2 class=\"wp-block-heading\">Observed Test Accuracy vs Clean Test Accuracy<\/h2>\n<p class=\"wp-block-paragraph\">Here we consider two possible test sets constructed out of the same set of text examples which only differ in some (~30%) of the labels. Representing typical data you&#8217;d use to evaluate accuracy, one version has labels sourced from a single annotation (human rater) per example, and we report the accuracy of model predictions computed on this version as <em>Observed Test Accuracy<\/em>. A second <em>cleaner<\/em> version of this same test set has high-quality labels established via consensus amongst many agreeing annotations per example (derived from multiple human raters). We report accuracy measured on the cleaner version as <em>Clean Test Accuracy<\/em>. Thus, <em>Clean Test Accuracy<\/em> more closely reflects what you care about (actual model deployment performance), but the <em>Observed Test Accuracy<\/em> is all you get to observe in most applications &#8211; unless you first clean your test data!<\/p>\n<p class=\"wp-block-paragraph\">Below are two test examples where the single human annotator mislabeled the example, but the group of many human annotators agreed on the correct label.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f2f3f1\" data-has-transparency=\"false\" style=\"--dominant-color: #f2f3f1;\" loading=\"lazy\" decoding=\"async\" width=\"2500\" height=\"1045\" class=\"wp-image-320526 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/13lmoRXur6xA0n3Ss8sZh_A.png\" alt=\"The orange annotations collected from a single annotator are cheaper to collect, but oftentimes incorrect. The blue annotations are collected from multiple annotators which are more expensive, but usually more accurate.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/13lmoRXur6xA0n3Ss8sZh_A.png 2500w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/13lmoRXur6xA0n3Ss8sZh_A-300x125.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/13lmoRXur6xA0n3Ss8sZh_A-1024x428.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/13lmoRXur6xA0n3Ss8sZh_A-768x321.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/13lmoRXur6xA0n3Ss8sZh_A-1536x642.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/13lmoRXur6xA0n3Ss8sZh_A-2048x856.png 2048w\" sizes=\"auto, (max-width: 2500px) 100vw, 2500px\" \/><figcaption class=\"wp-element-caption\">The orange annotations collected from a single annotator are cheaper to collect, but oftentimes incorrect. The blue annotations are collected from multiple annotators which are more expensive, but usually more accurate.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In real-world projects, you often don&#8217;t have access to such &quot;clean&quot; labels, so you can only measure <em>Observed Test Accuracy<\/em>. If you are making critical decisions such as which LLM or prompt to use based on this metric, be sure to first verify the labels are high-quality. Otherwise, we find you may make the <strong>wrong decisions<\/strong>, as observed below when selecting prompts for politeness classification.<\/p>\n<h2 class=\"wp-block-heading\">Impact of Noisy Evaluation Data<\/h2>\n<p class=\"wp-block-paragraph\">As a predictive model to classify the politeness of text, it is natural to employ a pretrained Large Language Model (LLM). Here, we specifically use data scientists&#8217; favorite LLM &#8211; the open-source FLAN-T5 model. To get this LLM to accurately predict the politeness of text, we must feed it just the right prompts. Prompt engineering can be very sensitive, with small changes greatly affecting accuracy!<\/p>\n<p class=\"wp-block-paragraph\">Prompts A and B shown below (highlighted text) are two different examples of <em>chain-of-thought<\/em> prompts, that can be appended in front of any <strong>text sample<\/strong> in order to get the LLM to classify its politeness. These prompts combine <em>few-shot<\/em> and <em>instruction<\/em> prompts (details later) that provide examples, the correct response, and a justification that encourages the LLM to explain its reasoning. The only difference between these two prompts is the highlighted text that is actually eliciting a response from the LLM. The few-shot examples and reasoning remain the same.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ecece8\" data-has-transparency=\"false\" style=\"--dominant-color: #ecece8;\" loading=\"lazy\" decoding=\"async\" width=\"2500\" height=\"1539\" class=\"wp-image-320527 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1u0o9r9kWy_C1f1E2Gb0KzA.png\" alt=\"chain-of-thought prompts provide the model with reasoning as to why the answer is correct for each text example given.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1u0o9r9kWy_C1f1E2Gb0KzA.png 2500w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1u0o9r9kWy_C1f1E2Gb0KzA-300x185.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1u0o9r9kWy_C1f1E2Gb0KzA-1024x630.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1u0o9r9kWy_C1f1E2Gb0KzA-768x473.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1u0o9r9kWy_C1f1E2Gb0KzA-1536x946.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1u0o9r9kWy_C1f1E2Gb0KzA-2048x1261.png 2048w\" sizes=\"auto, (max-width: 2500px) 100vw, 2500px\" \/><figcaption class=\"wp-element-caption\">chain-of-thought prompts provide the model with reasoning as to why the answer is correct for each text example given.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The natural way to decide which prompt is better is based on their <em>Observed Test Accuracy.<\/em> When used to prompt the FLAN-T5 LLM, we see below that the classifications produced by Prompt A have higher <em>Observed Test Accuracy<\/em> on the original test set than those from Prompt B. So obviously we should deploy our LLM with Prompt A, <em>right<\/em>? Not so fast!<\/p>\n<p class=\"wp-block-paragraph\">When we assess the <em>Clean Test Accuracy<\/em> of each prompt, we find that Prompt B is actually <strong>much better<\/strong> than Prompt A (by 4.5 percentage points). Since <em>Clean Test Accuracy<\/em> more closely reflects the true performance we actually care about, we would&#8217;ve made the wrong decision if we just relied on the original test data without examining its label quality!<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"eef0ed\" data-has-transparency=\"false\" style=\"--dominant-color: #eef0ed;\" loading=\"lazy\" decoding=\"async\" width=\"2500\" height=\"1032\" class=\"wp-image-320530 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1pu0j-nBc4DUQSTGDht_8dA.png\" alt=\"Using the observed accuracy, you would select Prompt A as better. Prompt B is actually the better prompt when evaluated on the clean test set.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1pu0j-nBc4DUQSTGDht_8dA.png 2500w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1pu0j-nBc4DUQSTGDht_8dA-300x124.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1pu0j-nBc4DUQSTGDht_8dA-1024x423.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1pu0j-nBc4DUQSTGDht_8dA-768x317.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1pu0j-nBc4DUQSTGDht_8dA-1536x634.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1pu0j-nBc4DUQSTGDht_8dA-2048x845.png 2048w\" sizes=\"auto, (max-width: 2500px) 100vw, 2500px\" \/><figcaption class=\"wp-element-caption\">Using the observed accuracy, you would select Prompt A as better. Prompt B is actually the better prompt when evaluated on the clean test set.<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Is this just statistical fluctuation?<\/h2>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/en.wikipedia.org\/wiki\/McNemar%27s_test\">McNemar&#8217;s test<\/a> is a recommended way to assess the statistical significance of reported differences in ML accuracy. When we apply this test to assess the 4.5% difference in <em>Clean Test Accuracy<\/em> between Prompt A vs. B over our 700 text examples, the difference is highly statistically significant (p-value = 0.007, <em>X\u00b2<\/em> = 7.086). Thus all evidence suggests Prompt B is a meaningfully better choice &#8211; we should not have failed to select it by carefully auditing our original test data!<\/p>\n<h2 class=\"wp-block-heading\">Is this a fluke result that just happened to be the case for these two prompts?<\/h2>\n<p class=\"wp-block-paragraph\">Let&#8217;s look at other types of prompts as well to see if the results were just coincidental for our pair of chain-of-thought prompts.<\/p>\n<h2 class=\"wp-block-heading\">Instruction Prompts<\/h2>\n<p class=\"wp-block-paragraph\">This type of prompt simply provides an <em>instruction<\/em> to the LLM on what it needs to do with the text example given. Consider the following pair of such prompts we might want to choose between.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f0f2ef\" data-has-transparency=\"false\" style=\"--dominant-color: #f0f2ef;\" loading=\"lazy\" decoding=\"async\" width=\"2500\" height=\"899\" class=\"wp-image-320532 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1b5muu8uEeJ73TyMTfF8hJA.png\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1b5muu8uEeJ73TyMTfF8hJA.png 2500w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1b5muu8uEeJ73TyMTfF8hJA-300x108.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1b5muu8uEeJ73TyMTfF8hJA-1024x368.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1b5muu8uEeJ73TyMTfF8hJA-768x276.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1b5muu8uEeJ73TyMTfF8hJA-1536x552.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1b5muu8uEeJ73TyMTfF8hJA-2048x736.png 2048w\" sizes=\"auto, (max-width: 2500px) 100vw, 2500px\" \/><\/figure>\n<h2 class=\"wp-block-heading\">Few-Shot Prompts<\/h2>\n<p class=\"wp-block-paragraph\">This type of prompt uses two instructions, a <em>prefix,<\/em> and a <em>suffix,<\/em> and also includes two (pre-selected) examples from the text corpus to provide clear demonstrations to the LLM of the desired input-output mapping. Consider the following pair of such prompts we might want to choose between.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"eef0ee\" data-has-transparency=\"false\" style=\"--dominant-color: #eef0ee;\" loading=\"lazy\" decoding=\"async\" width=\"2500\" height=\"1203\" class=\"wp-image-320535 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1AqZWxAhY3SRhZ6S1hjhiFg.png\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1AqZWxAhY3SRhZ6S1hjhiFg.png 2500w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1AqZWxAhY3SRhZ6S1hjhiFg-300x144.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1AqZWxAhY3SRhZ6S1hjhiFg-1024x493.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1AqZWxAhY3SRhZ6S1hjhiFg-768x370.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1AqZWxAhY3SRhZ6S1hjhiFg-1536x739.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1AqZWxAhY3SRhZ6S1hjhiFg-2048x985.png 2048w\" sizes=\"auto, (max-width: 2500px) 100vw, 2500px\" \/><\/figure>\n<h2 class=\"wp-block-heading\">Templatized Prompts<\/h2>\n<p class=\"wp-block-paragraph\">This type of prompt uses two instructions, an optional <em>prefix,<\/em> and a <em>suffix,<\/em> in addition to multiple-choice formatting so that the model performs classification as a multiple-choice answer rather than responding directly with a predicted class. Consider the following pair of such prompts we might want to choose between.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f0f1ee\" data-has-transparency=\"false\" style=\"--dominant-color: #f0f1ee;\" loading=\"lazy\" decoding=\"async\" width=\"2500\" height=\"947\" class=\"wp-image-320536 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1iyXcKMcAEbM60VWYDB3lnw.png\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1iyXcKMcAEbM60VWYDB3lnw.png 2500w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1iyXcKMcAEbM60VWYDB3lnw-300x114.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1iyXcKMcAEbM60VWYDB3lnw-1024x388.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1iyXcKMcAEbM60VWYDB3lnw-768x291.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1iyXcKMcAEbM60VWYDB3lnw-1536x582.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1iyXcKMcAEbM60VWYDB3lnw-2048x776.png 2048w\" sizes=\"auto, (max-width: 2500px) 100vw, 2500px\" \/><\/figure>\n<h2 class=\"wp-block-heading\">Results for various types of prompts<\/h2>\n<p class=\"wp-block-paragraph\">Beyond chain-of-thought, we also evaluated the classification performance of the same FLAN-T5 LLM with these three additional types of prompts. Plotting the <em>Observed Test Accuracy<\/em> vs. <em>Clean Test Accuracy<\/em> achieved with all of these prompts below, we see many pairs of prompts that suffer from the same aforementioned problem, where relying on <em>Observed Test Accuracy<\/em> leads to selecting the prompt that is actually worse.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f9f9f9\" data-has-transparency=\"true\" style=\"--dominant-color: #f9f9f9;\" loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" class=\"wp-image-320537 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1j4UTPLn5ESVGuvBLXYlLoA.png\" alt=\"As a prompt engineer using the available test data, you would choose the gray A prompt in the upper left (highest observed accuracy) yet the optimal prompt is actually the gray B in the upper right (highest clean accuracy).\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1j4UTPLn5ESVGuvBLXYlLoA.png 640w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1j4UTPLn5ESVGuvBLXYlLoA-300x225.png 300w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><figcaption class=\"wp-element-caption\">As a prompt engineer using the available test data, you would choose the gray A prompt in the upper left (highest observed accuracy) yet the optimal prompt is actually the gray B in the upper right (highest clean accuracy).<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Based on solely the <em>Observed Test Accuracy<\/em>, you would be inclined to select each of the &quot;A&quot; prompts over the &quot;B&quot; prompts amongst each type of prompt. However, the better prompt for each of the prompt types is actually prompt B (which has higher <em>Clean Test Accuracy<\/em>). <strong>Each of these prompt pairs highlights the need to verify test data quality, otherwise, you can make suboptimal decisions due to data issues like noisy annotations.<\/strong><\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e4e6e4\" data-has-transparency=\"false\" style=\"--dominant-color: #e4e6e4;\" loading=\"lazy\" decoding=\"async\" width=\"2500\" height=\"2347\" class=\"wp-image-320538 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1ygTALO30urccoKT9oFMwMA.png\" alt=\"All of the A prompts appear to be better due to their higher observed accuracy, yet all of the B prompts are objectively better when evaluated on the ground truth test data.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1ygTALO30urccoKT9oFMwMA.png 2500w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1ygTALO30urccoKT9oFMwMA-300x282.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1ygTALO30urccoKT9oFMwMA-1024x961.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1ygTALO30urccoKT9oFMwMA-768x721.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1ygTALO30urccoKT9oFMwMA-1536x1442.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1ygTALO30urccoKT9oFMwMA-2048x1923.png 2048w\" sizes=\"auto, (max-width: 2500px) 100vw, 2500px\" \/><figcaption class=\"wp-element-caption\">All of the A prompts appear to be better due to their higher observed accuracy, yet all of the B prompts are objectively better when evaluated on the ground truth test data.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">You can also see in this graphic how all of the A prompts observed accuracies are circled, meaning that they have higher accuracies than their B counterparts. Similarly, all of the B prompts clean accuracies are circled, meaning that they have higher accuracies than their B counterparts. Just like the simple example at the beginning of this article, you would be inclined to pick all of the A prompts, when in actuality the B prompts do a much better job.<\/p>\n<h2 class=\"wp-block-heading\">Improving Available Test Data for More Reliable Evaluation<\/h2>\n<p class=\"wp-block-paragraph\">Hopefully, the importance of high-quality evaluation data is clear. Let&#8217;s look at a couple of ways you could go about fixing the available test data.<\/p>\n<h3 class=\"wp-block-heading\">Manual Correction<\/h3>\n<p class=\"wp-block-paragraph\">The easiest way to ensure the quality of your test data is to simply review it by hand! Make sure to look through each of the examples to verify it is labeled correctly. Depending on the size of your test set, this may or may not be feasible. If your test set is relatively small (~100 examples) you could just look through them and make any corrections necessary. If your test set is large (1000+ examples), this would be too time-consuming and mentally to taxing to do by hand. Our test set is quite large, so we won&#8217;t be using this method!<\/p>\n<h3 class=\"wp-block-heading\">Algorithmic Correction<\/h3>\n<p class=\"wp-block-paragraph\">Another way to assess your available (possibly noisy) test set is to use data-centric AI algorithms in order to diagnose issues that can be fixed to obtain a more reliable version of the same dataset (without having to collect many additional human annotations). Here we use Confident Learning algorithms (via the open-source <a href=\"https:\/\/github.com\/cleanlab\/cleanlab\">cleanlab<\/a> package) to check our test data, which automatically estimate which examples appear to be mislabeled. We then inspect only these auto-detected label issues and fix their labels as needed to produce a higher-quality version of our test dataset. We call model accuracy measurements made over this version of the test dataset, the <em>CL Test Accuracy.<\/em><\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"dee7de\" data-has-transparency=\"false\" style=\"--dominant-color: #dee7de;\" loading=\"lazy\" decoding=\"async\" width=\"2500\" height=\"1942\" class=\"wp-image-320541 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1L2hy9GYkWy46UvmN_S60cw.png\" alt=\"The CL test accuracy is greater for all of the B prompts. Using CL we corrected the original test data and now can trust our model and prompt decisions.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1L2hy9GYkWy46UvmN_S60cw.png 2500w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1L2hy9GYkWy46UvmN_S60cw-300x233.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1L2hy9GYkWy46UvmN_S60cw-1024x795.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1L2hy9GYkWy46UvmN_S60cw-768x597.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1L2hy9GYkWy46UvmN_S60cw-1536x1193.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1L2hy9GYkWy46UvmN_S60cw-2048x1591.png 2048w\" sizes=\"auto, (max-width: 2500px) 100vw, 2500px\" \/><figcaption class=\"wp-element-caption\">The CL test accuracy is greater for all of the B prompts. Using CL we corrected the original test data and now can trust our model and prompt decisions.<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Using this new CL-corrected test set for model evaluation, we see that all of the B prompts from before now properly display higher accuracy than their A counterparts. This means we can trust our decisions made based on the CL-corrected test set to be more reliable than those made based on the noisy original test data.<\/p>\n<p class=\"wp-block-paragraph\">Of course, Confident Learning cannot magically identify all errors in any dataset. How well this algorithm detects labeling errors will depend on having reasonable predictions from a baseline ML model and even then, certain types of systematically-introduced errors will remain undetectable (for instance if we swap the definition of two classes entirely). For the precise list of mathematical assumptions under which Confident Learning can be proven effective, refer to the <a href=\"https:\/\/dl.acm.org\/doi\/10.1613\/jair.1.12125\">original paper by Northcutt et al.<\/a> For many real-world text\/image\/audio\/tabular datasets, this algorithm appears to at least offer an effective way to focus limited data reviewing resources on the most suspicious examples lurking in a large dataset.<\/p>\n<p class=\"wp-block-paragraph\"><strong>You don&#8217;t always need to spend the time\/resources to curate a &quot;perfect&quot; evaluation set &#8211; using algorithms like Confident Learning to diagnose and correct possible issues in your available test set can provide high-quality data to ensure optimal prompt and model selections.<\/strong><\/p>\n<p class=\"wp-block-paragraph\"><em>All images unless otherwise noted are by the author.<\/em><\/p>","protected":false},"excerpt":{"rendered":"<p>You may choose suboptimal prompts for your LLM (or make other suboptimal choices via model evaluation) unless you clean your test data<\/p>\n","protected":false},"author":18,"featured_media":5709,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"is_member_only":false,"sub_heading":"You may choose suboptimal prompts for your LLM (or make other suboptimal choices via model evaluation) unless you clean your test data","footnotes":""},"categories":[17,44,21,22],"tags":[447,448,450,446,656],"sponsor":[],"coauthors":[27503],"class_list":["post-5708","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-artificial-intelligence","category-data-science","category-large-language-models","category-machine-learning","tag-artificial-intelligence","tag-data-science","tag-large-language-models","tag-machine-learning","tag-prompt-engineering"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5 | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5 | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"You may choose suboptimal prompts for your LLM (or make other suboptimal choices via model evaluation) unless you clean your test data\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2023-06-16T18:39:53+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-01-08T20:25:38+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1cjwosgUXyO6xbCChC90JQA.jpeg\" \/>\n\t<meta property=\"og:image:width\" content=\"1912\" \/>\n\t<meta property=\"og:image:height\" content=\"1401\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Chris Mauck\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Chris Mauck\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"11 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5\",\"datePublished\":\"2023-06-16T18:39:53+00:00\",\"dateModified\":\"2025-01-08T20:25:38+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/\"},\"wordCount\":2266,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1cjwosgUXyO6xbCChC90JQA.jpeg\",\"keywords\":[\"Artificial Intelligence\",\"Data Science\",\"Large Language Models\",\"Machine Learning\",\"Prompt Engineering\"],\"articleSection\":[\"Artificial Intelligence\",\"Data Science\",\"Large Language Models\",\"Machine Learning\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/\",\"url\":\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/\",\"name\":\"Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5 | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1cjwosgUXyO6xbCChC90JQA.jpeg\",\"datePublished\":\"2023-06-16T18:39:53+00:00\",\"dateModified\":\"2025-01-08T20:25:38+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1cjwosgUXyO6xbCChC90JQA.jpeg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1cjwosgUXyO6xbCChC90JQA.jpeg\",\"width\":1912,\"height\":1401,\"caption\":\"Credit: Arthur Osipyan, Unsplash\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5 | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/","og_locale":"en_US","og_type":"article","og_title":"Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5 | Towards Data Science","og_description":"You may choose suboptimal prompts for your LLM (or make other suboptimal choices via model evaluation) unless you clean your test data","og_url":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/","og_site_name":"Towards Data Science","article_published_time":"2023-06-16T18:39:53+00:00","article_modified_time":"2025-01-08T20:25:38+00:00","og_image":[{"width":1912,"height":1401,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1cjwosgUXyO6xbCChC90JQA.jpeg","type":"image\/jpeg"}],"author":"Chris Mauck","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Chris Mauck","Est. reading time":"11 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5","datePublished":"2023-06-16T18:39:53+00:00","dateModified":"2025-01-08T20:25:38+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/"},"wordCount":2266,"commentCount":0,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1cjwosgUXyO6xbCChC90JQA.jpeg","keywords":["Artificial Intelligence","Data Science","Large Language Models","Machine Learning","Prompt Engineering"],"articleSection":["Artificial Intelligence","Data Science","Large Language Models","Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/","url":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/","name":"Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5 | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1cjwosgUXyO6xbCChC90JQA.jpeg","datePublished":"2023-06-16T18:39:53+00:00","dateModified":"2025-01-08T20:25:38+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1cjwosgUXyO6xbCChC90JQA.jpeg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/06\/1cjwosgUXyO6xbCChC90JQA.jpeg","width":1912,"height":1401,"caption":"Credit: Arthur Osipyan, Unsplash"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/beware-of-unreliable-data-in-model-evaluation-a-llm-prompt-selection-case-study-with-flan-t5-88cfd469d058\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Towards Data Science","distributor_original_site_url":"https:\/\/towardsdatascience.com","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/5708","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=5708"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/5708\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/5709"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=5708"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=5708"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=5708"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=5708"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=5708"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}