{"id":606581,"date":"2025-07-14T18:44:18","date_gmt":"2025-07-14T23:44:18","guid":{"rendered":"https:\/\/towardsdatascience.com\/?p=606581"},"modified":"2025-07-14T18:44:33","modified_gmt":"2025-07-14T23:44:33","slug":"topic-model-labelling-with-llms","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/","title":{"rendered":"Topic Model Labelling with\u00a0LLMs"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><strong>By<\/strong>: <em>Petr Kor\u00e1b<\/em>*, <em>Martin Feldkircher**, *** Viktoriya Teliha<\/em>** (*Text Mining Stories, Prague, **Vienna School of International Studies, ***Centre for Applied Macroeconomic Analysis, Australia).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong><mdspan datatext=\"el1752002640909\" class=\"mdspan-comment\">Manual labeling<\/mdspan> <\/strong>of terms produced by topic models requires domain experience and may be subjective to the labeler. Especially when the number of topics grows large, it might be convenient to assign human-readable names to topics automatically with an LLM. Simply copying and pasting the results into UIs, such as <a href=\"https:\/\/chatgpt.com\" target=\"_blank\" rel=\"noreferrer noopener\">chatgpt.com,<\/a> is quite a \u201cblack-box\u201d and unsystematic. A better choice would be to add topic labeling to the code with a documented labeler, which gives the engineer more control over the results and ensures reproducibility. This tutorial will explore in detail:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">How to train a topic model with a fresh new <strong>Turftopic<\/strong> Python package<\/li>\n\n\n\n<li class=\"wp-block-list-item\">How to label topic model results with GPT-4.0 mini.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We will train a cutting-edge <strong>FASTopic <\/strong>model by <a href=\"https:\/\/arxiv.org\/pdf\/2405.17978\" target=\"_blank\" rel=\"noreferrer noopener\">Xiaobao Wu et al. [3<\/a>] presented at last year\u2019s <a href=\"https:\/\/neurips.cc\/virtual\/2024\/poster\/96416\" target=\"_blank\" rel=\"noreferrer noopener\">NeurIPS<\/a>. This model <a href=\"https:\/\/medium.com\/text-mining-stories\/choose-the-right-one-evaluating-topic-models-for-business-intelligence-1e2f418d7573?sk=3bc07126cc44a1d254fb8a0967702957\" target=\"_blank\" rel=\"noreferrer noopener\">outperforms other competing models<\/a>, such as <strong>BERTopic<\/strong>, in several key metrics (e.g., topic diversity) and has broad <a href=\"https:\/\/medium.com\/data-science\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37?sk=9a88660d4e4c64a1d91ad8ede730a520\" target=\"_blank\" rel=\"noreferrer noopener\">applications in business intelligence.<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Components of the Topic Modelling Pipeline<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Labelling is the essential part of the topic modelling pipeline because it bridges the model outputs with real-world decisions. The model assigns a number to each topic, but a business decision relies on the human-readable text label summarizing the typical terms in each topic. The models are typically labelled by (1) labellers with the domain experience, often using a well-defined labelling strategy, (2) LLMs, and (3) commercial tools. The path from raw data to decision-making through a topic model is nicely explained in Image 1.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/Components-of-topic-modelling-pipeline-1-1024x793.png\" alt=\"\" class=\"wp-image-609185\"\/><figcaption class=\"wp-element-caption\">Image 1. Components of the topic modeling pipeline.<br>Source: adapted and extended from Kardos et al [2].<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The pipeline starts with raw data, which is preprocessed and vectorized for the topic model. The model returns topics named with integers, including typical terms (words or bigrams). The labeling layer replaces the integer in the topic name with the text label. The model user (<a href=\"https:\/\/xenoss.io\/blog\/topic-modeling-for-business-introduction\">product manager<\/a>, <a href=\"https:\/\/medium.com\/text-mining-stories\/choose-the-right-one-evaluating-topic-models-for-business-intelligence-1e2f418d7573?sk=3bc07126cc44a1d254fb8a0967702957\">customer care <\/a>dept., etc.) then works with labelled terms to make data-informed decisions. In the following modeling example, we will follow it step by step.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Data<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We will use FASTopic to classify customer complaints data into 10 topics. The example use case uses a synthetically generated <a href=\"https:\/\/www.kaggle.com\/datasets\/rtweera\/customer-care-emails\" target=\"_blank\" rel=\"noreferrer noopener\">Customer Care Email<\/a> dataset available on Kaggle, licensed under the <a href=\"https:\/\/www.gnu.org\/licenses\/gpl-3.0.html\" target=\"_blank\" rel=\"noreferrer noopener\">GPL-3 license<\/a>. The prefiltered data covers 692 incoming emails to the customer care department and looks like this:<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/1m45AM5Blp2AY-woHy84mxQ.png\" alt=\"\" class=\"wp-image-609249\"\/><figcaption class=\"wp-element-caption\">Image 2. <a href=\"https:\/\/www.kaggle.com\/datasets\/rtweera\/customer-care-emails\" target=\"_blank\" rel=\"noreferrer noopener\">Customer Care Email <\/a>dataset. Image by&nbsp;authors.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">2.1. Data preprocessing<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Text data is sequentially preprocessed in six steps. Numbers are removed first, followed by emojis. English stopwords are removed afterward, followed by punctuation. Additional tokens (such as company and person names) are removed in the next step before lemmatization. Read more on text preprocessing for topic models in our <a href=\"https:\/\/medium.com\/data-science\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37?sk=9a88660d4e4c64a1d91ad8ede730a520\">previous tutorial<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, we read the clean data and tokenize the dataset:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import pandas as pd\n\n# Read data\ndata = pd.read_csv(&quot;data.csv&quot;, usecols=[&#039;message_clean&#039;])\n\n# Create corpus list\ndocs = data[&quot;message_clean&quot;].tolist()<\/code><\/pre>\n\n\n\n<figure class=\"wp-block-image alignfull\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/1fIP08_8zJJTRcWAn6DRdiQ.png\" alt=\"\" class=\"wp-image-609248\"\/><figcaption class=\"wp-element-caption\">Image 3. Recommended cleaning pipeline for topic models. Image by&nbsp;authors.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">2.2. Bigram vectorization<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Next, we create a bigram tokenizer to process tokens as bigrams during the model training. Bigram models provide more relevant information and identify better key qualities and problems for business decisions than single-word models <em>(\u201cdelivery\u201d vs. \u201cpoor delivery\u201d, \u201cstomach\u201d vs. \u201csensitive stomach\u201d, etc.)<\/em>.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from sklearn.feature_extraction.text import CountVectorizer\n\nbigram_vectorizer = CountVectorizer(\n    ngram_range=(2, 2),               # only bigrams\n    max_features=1000                 # top 1000 bigrams by frequency\n)<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">3. Model training<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The FASTopic model is currently implemented in two Python packages:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/pypi.org\/project\/fastopic\/\"><strong>Fastopic<\/strong><\/a>:\u200aofficial package by X. Wu<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><a href=\"https:\/\/pypi.org\/project\/turftopic\/\"><strong>Turftopic<\/strong><\/a><strong>\u200a<\/strong>:\u200anew Python package that brings many helpful topic modeling features, including labeling with LLMs [2]<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">We will use the Turftopic implementation because of the direct link between the model and the <a href=\"https:\/\/x-tabdeveloping.github.io\/turftopic\/namers\/\">Namer<\/a> that offers LLM labelling.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s set up the model and fit it to the data. It is essential to set a random state to secure training reproducibility.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from turftopic import FASTopic\n\n# Model specification\ntopic_size  = 10\nmodel = FASTopic(n_components = topic_size,       # train for 10 topics\n                 vectorizer = bigram_vectorizer,  # generate bigrams in topics\n                 random_state = 32).fit(docs)     # set random state \n\n# Fit model to corpus\ntopic_data = model.prepare_topic_data(docs)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Now, let&#8217;s prepare a dataframe with topic IDs and the top 10 bigrams with the highest probability received from the model (code is <a href=\"https:\/\/github.com\/PetrKorab\/Topic-Model-Labelling-with-LLMs\/blob\/main\/code.ipynb\">here<\/a>).<\/p>\n\n\n\n<figure class=\"wp-block-image alignfull size-large\" datatext=\"\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/unlabelled-topics-1-1024x223.png\" alt=\"\" class=\"wp-image-609076\"\/><figcaption class=\"wp-element-caption\">Image 4. Unlabeled topics in FASTopic. Image by authors.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">4. Topic labeling<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In the next step, we add text labels to the topic IDs with GPT4-o-mini. Let\u2019s follow these steps:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Create an <a href=\"https:\/\/auth.openai.com\/create-account\">\ud835\udc0e\ud835\udc29\ud835\udc1e\ud835\udc27 \ud835\udc00\ud835\udc08 \ud835\udc1a\ud835\udc1c\ud835\udc1c\ud835\udc28\ud835\udc2e\ud835\udc27\ud835\udc2d<\/a> and choose a billing plan (e.g. \u201c\ud835\ude17\ud835\ude22\ud835\ude3a \ud835\ude22\ud835\ude34 \ud835\ude3a\ud835\ude30\ud835\ude36 \ud835\ude28\ud835\ude30\u201d)<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Set an environment variable with OpenAI API key<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Use <a href=\"https:\/\/x-tabdeveloping.github.io\/turftopic\/namers\">Turftopic\u2019s \ud835\udc0d\ud835\udc1a\ud835\udc26\ud835\udc1e\ud835\udc2b<\/a> to label keywords in the topics with the LLM.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">With this code, we label the topics and add a new row <em>topic_name<\/em> to the dataframe.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from turftopic.namers import OpenAITopicNamer\nimport os\n\n# OpenAI API key key to access GPT-4\nos.environ[&quot;OPENAI_API_KEY&quot;] = &quot;&quot;   \n\n# use Namer to label topic model with LLM\nnamer = OpenAITopicNamer(&quot;gpt-4o-mini&quot;)\nmodel.rename_topics(namer)\n\n# create a dataframe with labelled topics\ntopics_df = model.topics_df()\ntopics_df.columns = [&#039;topic_id&#039;, &#039;topic_name&#039;, &#039;topic_words&#039;]\n\n# split and explode\ntopics_df[&#039;topic_word&#039;] = topics_df[&#039;topic_words&#039;].str.split(&#039;,&#039;)\ntopics_df = topics_df.explode(&#039;topic_word&#039;)\ntopics_df[&#039;topic_word&#039;] = topics_df[&#039;topic_word&#039;].str.strip()\n\n# add a rank for each word within a topic\ntopics_df[&#039;word_rank&#039;] = topics_df.groupby(&#039;topic_id&#039;).cumcount() + 1\n\n# pivot to wide format\nwide = topics_df.pivot(index=&#039;word_rank&#039;, \n                       columns=[&#039;topic_id&#039;, &#039;topic_name&#039;], values=&#039;topic_word&#039;)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here is the table with labeled topics after additional transformations. It would be interesting to compare the LLM results with those of a company insider who is familiar with the company&#8217;s processes and customer base. The dataset is synthetic, so let\u2019s rely on the GPT-4 labeling.<\/p>\n\n\n\n<figure class=\"wp-block-image alignfull size-large\" datatext=\"\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/labelled-topics-2-1024x267.png\" alt=\"\" class=\"wp-image-609079\"\/><figcaption class=\"wp-element-caption\">Image 5. Labeled topics in FASTopic by GPT4\u2013o-mini. Image by&nbsp;authors.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">We can also visualize the labeled topics for a better presentation. The code for the bigram word cloud visualization, generated from the topics produced by the model, is <a href=\"https:\/\/github.com\/PetrKorab\/Topic-Model-Labelling-with-LLMs\">here<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/FAST_10_wordcloud-4-1024x267.png\" alt=\"\" class=\"wp-image-609152\"\/><figcaption class=\"wp-element-caption\">Image 6. Word cloud visualization of labeled topics by GPT4\u2013o-mini. Image by authors.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">The new Turftopic Python package links recent topic models with the <a href=\"https:\/\/x-tabdeveloping.github.io\/turftopic\/namers\/\" target=\"_blank\" rel=\"noreferrer noopener\">LLM-based labeler<\/a> for generating human-readable topic names.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">The main benefits are: 1) independence from the labeler\u2019s subjective experience, 2) capacity to label models with a large number of topics that a human labeler might have difficulty labeling independently, and 3) more control of the code and reproducibility.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Topic labeling with LLMs has a wide range of applications in diverse areas. <span style=\"margin: 0px; padding: 0px;\">Read&nbsp;<a href=\"https:\/\/crawford.anu.edu.au\/sites\/default\/files\/2025-06\/35_2025_Feldkircher_Korab_Teliha_1.pdf\" target=\"_blank\">our latest paper<\/a>&nbsp;on the topic modeling of central bank communication, where GPT-4 la<\/span>beled the FASTopic model.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">The labels are slightly different for each training, even with the random state. It is not caused by the Namer, but by the random processes in model training that output bigrams with probabilities in descending order. The differences in probabilities are in tiny decimals, so each training generates a few new terms in the top 10, which then impacts the LLM labeler.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The data and complete code for this tutorial are <a href=\"https:\/\/github.com\/PetrKorab\/Topic-Model-Labelling-with-LLMs\">here.<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em><strong>Petr Korab<\/strong> is a Senior Data Analyst and Founder of <a href=\"https:\/\/textminingstories.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Text Mining Stories<\/a> with over eight years of experience in Business Intelligence and NLP. <\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Sign up<strong> <\/strong>for <a href=\"https:\/\/textminingstories.com\/blog\">our blog<\/a> to get the latest news from the NLP industry!<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">[1] Feldkircher, M., Korab, P., Teliha, V., (2025).<em> <\/em>\u201c<a href=\"https:\/\/ideas.repec.org\/p\/een\/camaaa\/2025-35.html\" rel=\"noreferrer noopener\" target=\"_blank\">What Do Central Bankers Talk About? Evidence From the BIS Archive<\/a>,\u201d CAMA Working Papers 2025\u201335, Centre for Applied Macroeconomic Analysis, Crawford School of Public Policy, The Australian National University.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[2] Kardos, M., Enevoldsen, K. C., Kostkan, J., Kristensen-McLachlan, R. D., Rocca, R. (2025). Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers. <em>Journal of Open Source Software<\/em>, 10(111), 8183, <a href=\"https:\/\/doi.org\/10.21105\/joss.08183\" rel=\"noreferrer noopener\" target=\"_blank\">https:\/\/doi.org\/10.21105\/joss.08183<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[3] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. (2024). FASTopic: <a href=\"https:\/\/arxiv.org\/abs\/2405.17978\" rel=\"noreferrer noopener\" target=\"_blank\">A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm<\/a>. arXiv preprint: 2405.17978.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Python tutorial for reproducible labeling of cutting-edge topic models with GPT4-o-mini.<\/p>\n","protected":false},"author":18,"featured_media":606582,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"is_member_only":false,"sub_heading":"Python tutorial for reproducible labeling of cutting-edge topic models with GPT4-o-mini.","footnotes":""},"categories":[21],"tags":[465,446,568,467,512],"sponsor":[],"coauthors":[30697],"class_list":["post-606581","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-large-language-models","tag-llm","tag-machine-learning","tag-nlp","tag-python","tag-topic-modeling"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Topic Model Labelling with\u00a0LLMs | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Topic Model Labelling with\u00a0LLMs | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"Python tutorial for reproducible labeling of cutting-edge topic models with GPT4-o-mini.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2025-07-14T23:44:18+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-07-14T23:44:33+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/FAST_10_wordcloud.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1790\" \/>\n\t<meta property=\"og:image:height\" content=\"1131\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Petr Kor\u00e1b\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Petr Kor\u00e1b\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Topic Model Labelling with\u00a0LLMs\",\"datePublished\":\"2025-07-14T23:44:18+00:00\",\"dateModified\":\"2025-07-14T23:44:33+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/\"},\"wordCount\":1102,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/FAST_10_wordcloud.png\",\"keywords\":[\"Llm\",\"Machine Learning\",\"NLP\",\"Python\",\"Topic Modeling\"],\"articleSection\":[\"Large Language Models\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/\",\"url\":\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/\",\"name\":\"Topic Model Labelling with\u00a0LLMs | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/FAST_10_wordcloud.png\",\"datePublished\":\"2025-07-14T23:44:18+00:00\",\"dateModified\":\"2025-07-14T23:44:33+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/FAST_10_wordcloud.png\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/FAST_10_wordcloud.png\",\"width\":1790,\"height\":1131,\"caption\":\"Image by authors\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Topic Model Labelling with\u00a0LLMs\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Topic Model Labelling with\u00a0LLMs | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/","og_locale":"en_US","og_type":"article","og_title":"Topic Model Labelling with\u00a0LLMs | Towards Data Science","og_description":"Python tutorial for reproducible labeling of cutting-edge topic models with GPT4-o-mini.","og_url":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/","og_site_name":"Towards Data Science","article_published_time":"2025-07-14T23:44:18+00:00","article_modified_time":"2025-07-14T23:44:33+00:00","og_image":[{"width":1790,"height":1131,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/FAST_10_wordcloud.png","type":"image\/png"}],"author":"Petr Kor\u00e1b","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Petr Kor\u00e1b","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Topic Model Labelling with\u00a0LLMs","datePublished":"2025-07-14T23:44:18+00:00","dateModified":"2025-07-14T23:44:33+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/"},"wordCount":1102,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/FAST_10_wordcloud.png","keywords":["Llm","Machine Learning","NLP","Python","Topic Modeling"],"articleSection":["Large Language Models"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/","url":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/","name":"Topic Model Labelling with\u00a0LLMs | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/FAST_10_wordcloud.png","datePublished":"2025-07-14T23:44:18+00:00","dateModified":"2025-07-14T23:44:33+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/FAST_10_wordcloud.png","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/FAST_10_wordcloud.png","width":1790,"height":1131,"caption":"Image by authors"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/topic-model-labelling-with-llms\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Topic Model Labelling with\u00a0LLMs"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"TDS Contributor Portal","distributor_original_site_url":"https:\/\/contributor.insightmediagroup.io","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/606581","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=606581"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/606581\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/606582"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=606581"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=606581"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=606581"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=606581"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=606581"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}