{"id":605801,"date":"2025-04-24T14:50:50","date_gmt":"2025-04-24T19:50:50","guid":{"rendered":"https:\/\/towardsdatascience.com\/?p=605801"},"modified":"2025-04-24T14:51:05","modified_gmt":"2025-04-24T19:51:05","slug":"choose-the-right-one-evaluating-topic-models-for-business-intelligence","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/","title":{"rendered":"Choose the Right One: Evaluating Topic Models for Business Intelligence"},"content":{"rendered":"\n<p class=\"has-text-align-left wp-block-paragraph\"><strong><mdspan datatext=\"el1745524147724\" class=\"mdspan-comment\">Topic models<\/mdspan> <\/strong>are used in businesses to classify brand-related text datasets (such as product and site reviews, surveys, and social media comments) and to track how customer satisfaction metrics change over time.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There is a myriad of recent topic models one can choose from: the widely used <a href=\"https:\/\/pypi.org\/project\/bertopic\/\">BERTopic<\/a> by <a href=\"https:\/\/arxiv.org\/pdf\/2203.05794\">Maarten Grootendorst (2022)<\/a>, the recent <a href=\"https:\/\/pypi.org\/project\/fastopic\/\">FASTopic<\/a> presented at last year\u2019s <a href=\"https:\/\/neurips.cc\/virtual\/2024\/poster\/96416\">NeurIPS<\/a>, (<a href=\"https:\/\/arxiv.org\/pdf\/2405.17978\">Xiaobao Wu et al.,2024)<\/a>, the <a href=\"https:\/\/bab2min.github.io\/tomotopy\/v0.13.0\/en\/#tomotopy.DTModel\">Dynamic Topic Model<\/a><strong> <\/strong>by<a href=\"https:\/\/arxiv.org\/pdf\/1602.06049.pdf\"> Blei and Lafferty (2006<\/a>), or a fresh semi-supervised <a href=\"https:\/\/pypi.org\/project\/seededPF\/\">Seeded Poisson Factorization<\/a> model (<a href=\"https:\/\/arxiv.org\/abs\/2503.02741\">Prostmaier et al., 2025<\/a>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For a business use case, training topic models on customer texts, we often get results that are not identical and sometimes even conflicting. In business, imperfections cost money, so the engineers should place into production the model that provides the best solution and solves the problem most effectively. At the same pace that new topic models appear on the market, methods for evaluating their quality using new metrics also evolve.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This practical tutorial will focus on bigram topic models<em>, <\/em>which provide more relevant information and identify better key qualities and problems for business decisions than single-word models<em> (\u201cdelivery\u201d vs. \u201cpoor delivery\u201d, \u201cstomach\u201d vs. \u201csensitive stomach\u201d, <\/em>etc.). On one side, bigram models are more detailed; on the other, many evaluation metrics were not originally designed for their evaluation. To provide more background in this area, we will explore in detail:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">How to <strong>evaluate the quality of bigram topic models<\/strong><\/li>\n\n\n\n<li class=\"wp-block-list-item\">How to prepare an <strong>email classification pipeline <\/strong>in Python.<strong>&nbsp;<\/strong><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Our example use case will show how bigram topic models (<a href=\"https:\/\/pypi.org\/project\/bertopic\/\">BERTopic<\/a> and <a href=\"https:\/\/pypi.org\/project\/fastopic\/\">FASTopic<\/a>) help prioritize email communication with customers on certain topics and reduce response times.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. What are topic model quality indicators?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The evaluation task should target the ideal state:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>The ideal topic model should produce topics where words or bigrams (two consecutive words) in each topic are highly semantically related and distinct for each topic.<\/em><\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, this means that the words predicted for each topic are <a href=\"https:\/\/paperswithcode.com\/task\/semantic-similarity\" rel=\"noreferrer noopener\" target=\"_blank\">semantically similar<\/a> to human judgment, and there is low duplication of words between topics.<em>&nbsp;<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>It is standard to calculate a set of metrics for each trained model to make a qualified decision on which model to place into production or use for a business decision, comparing the model performance metrics.<\/em><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Coherence <\/strong>metrics<strong> <\/strong>evaluate how well the words discovered by a topic model make sense to humans (have <a href=\"https:\/\/www.geeksforgeeks.org\/understanding-semantic-analysis-nlp\/\" target=\"_blank\" rel=\"noreferrer noopener\">similar semantics<\/a> in each topic).<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>Topic diversity<\/strong> measures how different the discovered topics are from one another.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Bigram topic models work well with these metrics:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>NPMI <\/strong><em>(Normalized Point-wise Mutual Information) <\/em>uses probabilities estimated in a reference corpus to calculate a [-1:1] score for each word (or bigram) predicted by the model. Read <a href=\"https:\/\/arxiv.org\/pdf\/2401.15351\" target=\"_blank\" rel=\"noreferrer noopener\">[1]<\/a> for more details.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The reference corpus can be either internal (the training set) or external (e.g., an external email dataset). A large, external, and comparable corpus is a better choice because it can help reduce bias in training sets. Because this metric works with word frequencies,<strong> the training set and the reference corpus should be preprocessed the same way <\/strong>(i.e., if we remove numbers and stopwords in the training set, we should also do it in the reference corpus). The aggregate model score is the average of words across topics.&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>SC<em> <\/em><\/strong><em>(Semantic Coherence) <\/em>does not need a reference corpus. It uses the same dataset as was used to train the topic model. Read more in [<a href=\"https:\/\/aclanthology.org\/D11-1024\/\" target=\"_blank\" rel=\"noreferrer noopener\">2<\/a>].<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s say we have the Top 4 words for one topic: <em>\u201capple\u201d, \u201cbanana\u201d, \u201cjuice\u201d, \u201csmoothie\u201d <\/em>predicted by a topic model.<em> <\/em>Then <em>SC <\/em>looks at all combinations of words in the training set going from left to right, starting with the first word<em> {apple, banana}<\/em>, <em>{apple, juice}<\/em>, <em>{apple, smoothie}<\/em> then the second word <em>{banana, juice}<\/em>, <em>{banana, smoothie}<\/em>, then last word <em>{juice, smoothie} <\/em>and it counts the number of documents that contain both words, divided by the frequency of documents that contain the first word. Overall SC score for a model is the mean of all topic-level scores.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/semantic-coherence-illustration-3.png\" alt=\"\" class=\"wp-image-602132\"\/><figcaption class=\"wp-element-caption\">Image 1. Semantic coherence by Mimno et al. (2011) illustration. Image by author.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>PUV<em> <\/em><\/strong><em>(Percentage of Unique Words)<\/em> calculates the share of unique words across topics in the model. <em>PUV = 1<\/em> means that each topic in the model contains unique bigrams. Values close to 1 indicate a well-shaped, high-quality model with small word overlap between topics. <a href=\"https:\/\/aclanthology.org\/2020.tacl-1.29\/\" target=\"_blank\" rel=\"noreferrer noopener\">[3].<\/a><\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>The <strong>closer to 0 the SC and NIMP scores are<\/strong>, the more coherent the model is (bigrams predicted by the topic model for each topic are semantically similar). The <strong>closer to 1 PUV is<\/strong>, the easier the model is to interpret and use, because bigrams between topics do not overlap.&nbsp;<\/em><\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">2. How can we prioritize email communication with topic&nbsp;models?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A large share of customer communication, not only in e-commerce businesses, is now solved with chatbots and personal client sections. Yet, it is common to communicate with customers by email. Many email providers offer developers broad flexibility in APIs to customize their email platform (<a href=\"https:\/\/mailchimp.com\/developer\/tools\/\" rel=\"noreferrer noopener\" target=\"_blank\"><em>e.g., MailChimp<\/em><\/a><em>, <\/em><a href=\"https:\/\/github.com\/sendgrid\/sendgrid-python\" rel=\"noreferrer noopener\" target=\"_blank\"><em>SendGrid<\/em><\/a><em>, <\/em><a href=\"https:\/\/github.com\/getbrevo\/brevo-python\" rel=\"noreferrer noopener\" target=\"_blank\"><em>Brevo<\/em><\/a>). In this place, topic models make mailing more flexible and effective.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>In this use case, the pipeline takes the input from the incoming emails and uses the trained topic classifier to categorize the incoming email content. The outcome is the classified topic that the Customer Care (CC) Department sees next to each email. The main objective is to allow the CC staff to prioritize the categories of emails and reduce the response time to the most sensitive requests (that directly affect margin-related KPIs or OKRs).<\/em><\/p>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1c7KCukgLptSzBXYRE2vE_Q.png\" alt=\"\" class=\"wp-image-602144\"\/><figcaption class=\"wp-element-caption\">Image 2. Topic model pipeline illustration. Image by&nbsp;author.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">3. Data and model&nbsp;set-ups<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">We will train <strong>FASTopic a<\/strong>nd <strong>BERTopic <\/strong>to classify emails into 8 and 10 topics and evaluate the quality of all model specifications. Read my previous <a href=\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/\" target=\"_blank\" rel=\"noreferrer noopener\">TDS tutorial<\/a> on topic modeling with these cutting-edge topic models.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As a training set, we use a synthetically generated <a href=\"https:\/\/www.kaggle.com\/datasets\/rtweera\/customer-care-emails\" rel=\"noreferrer noopener\" target=\"_blank\">Customer Care Email <\/a>dataset available on Kaggle with a <a href=\"https:\/\/www.gnu.org\/licenses\/gpl-3.0.html\" rel=\"noreferrer noopener\" target=\"_blank\">GPL-3 license<\/a>. The prefiltered data covers 692 incoming emails and looks like this:<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1m45AM5Blp2AY-woHy84mxQ.png\" alt=\"\" class=\"wp-image-602147\"\/><figcaption class=\"wp-element-caption\">                                                            Image 3. Customer Care Email dataset. Image by author.                                                      <\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">3.1. Data preprocessing<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Cleaning text in the right order is essential for topic models to work in practice because it minimizes the bias of each cleaning operation.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Numbers <\/strong>are typically removed first, followed by <strong>emojis, <\/strong>unless we don\u2019t need them for special situations, such as extracting sentiment. <a href=\"https:\/\/www.geeksforgeeks.org\/removing-stop-words-nltk-python\/\" rel=\"noreferrer noopener\" target=\"_blank\"><strong>Stopwords <\/strong><\/a>for one or more languages are removed afterward, followed by <strong>punctuation <\/strong>so that stopwords don\u2019t break up into two tokens (<em>\u201cwe\u2019ve\u201d<\/em> -&gt; <em>\u201cwe\u201d + \u2018ve\u201d<\/em>). <strong>Additional tokens <\/strong>(company and people\u2019s names, etc.) are removed in the next step in the clean data before <a href=\"https:\/\/www.techtarget.com\/searchenterpriseai\/definition\/lemmatization#:~:text=Lemmatization%20is%20the%20process%20of,processing%20%28NLP%29%20and%20chatbots.\" rel=\"noreferrer noopener\" target=\"_blank\"><strong>lemmatization<\/strong><\/a>, which unifies tokens with the same semantics.<\/p>\n\n\n\n<figure class=\"wp-block-image alignfull\" datatext=\"\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1s5bKzmGiyQ_i-JyOLL4vdQ.png\" alt=\"\" class=\"wp-image-602146\"\/><figcaption class=\"wp-element-caption\">Image 4. General preprocessing steps for topic modeling. Image by author<\/figcaption><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>\u201cDelivery\u201d and \u201cdeliveries\u201d, \u201cbox\u201d and \u201cBoxes\u201d, or \u201cPrice\u201d and \u201cprices\u201d share the same word root, but without lemmatization, topic models would model them as separate factors. That\u2019s why customer emails should be lemmatized in the last step of preprocessing.<\/em><\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Text preprocessing is model-specific:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong><em>FASTopic<\/em><\/strong> works with clean data on input; some cleaning (stopwords) can be done during the training. The simplest and most effective way is to use the <a href=\"https:\/\/washer.textminingstories.com\/\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Washer, a no-code app for text data cleaning<\/em><\/a><strong><em> <\/em><\/strong>that offers a no-code way of data preprocessing for text mining projects.<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong><em>BERTopic: <\/em><\/strong>the <a href=\"https:\/\/maartengr.github.io\/BERTopic\/faq.html#how-do-i-reduce-topic-outliers\" target=\"_blank\" rel=\"noreferrer noopener\">documentation<\/a> recommends that \u201cr<em>emoving stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context to create accurate embeddings\u201d. <\/em>For this reason, cleaning operations should be included in the model training.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3.2. Model compilation and&nbsp;training<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You can check the full codes for FASTopic and BERTopic\u2019s training with bigram preprocessing and cleaning in <a href=\"https:\/\/github.com\/PetrKorab\/Choose-the-Right-One-Evaluating-Topic-Models-for-Business-Intelligence\" target=\"_blank\" rel=\"noreferrer noopener\">this repo<\/a>. My previous TDS tutorials <a href=\"https:\/\/medium.com\/data-science\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37?sk=9a88660d4e4c64a1d91ad8ede730a520\" target=\"_blank\" rel=\"noreferrer noopener\">(<\/a><a href=\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/\" target=\"_blank\" rel=\"noreferrer noopener\">4<\/a><a href=\"https:\/\/medium.com\/data-science\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37?sk=9a88660d4e4c64a1d91ad8ede730a520\" target=\"_blank\" rel=\"noreferrer noopener\">)<\/a> and <a href=\"https:\/\/medium.com\/data-science\/topic-modelling-with-berttopic-in-python-8a80d529de34?sk=16ca7ea6c5cdbbc92fead7f9c34c8584\" target=\"_blank\" rel=\"noreferrer noopener\">(<\/a><a href=\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/\" target=\"_blank\" rel=\"noreferrer noopener\">5<\/a><a href=\"https:\/\/medium.com\/data-science\/topic-modelling-with-berttopic-in-python-8a80d529de34?sk=16ca7ea6c5cdbbc92fead7f9c34c8584\" target=\"_blank\" rel=\"noreferrer noopener\">)<\/a> explain all steps in detail.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We train both models to classify 8 topics in customer email data. A simple inspection of the topic distribution shows that incoming emails to FASTopic are quite well distributed across topics. BERTopic classifies emails unevenly, keeping outliers (uncategorized tokens) in T-1 and a large share of incoming emails in T0.<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\" datatext=\"\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/Combined_Topic_Distribution-BERTFAST-1024x352.png\" alt=\"\" class=\"wp-image-602136\"\/><figcaption class=\"wp-element-caption\">Image 5: Topic distribution, email classification. Image by author.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Here are the predicted bigrams for both models with topic labels:<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/FAST-labelled-1024x317.png\" alt=\"\" class=\"wp-image-602142\"\/><\/figure>\n\n\n\n<figure class=\"wp-block-image alignwide size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/BERT-labelled-1024x267.png\" alt=\"\" class=\"wp-image-602143\"\/><figcaption class=\"wp-element-caption\">Image 6: Models\u2019 predictions. Image by&nbsp;author.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Because the email corpus is a synthetic LLM-generated dataset, the naive labelling of the topics for both models shows topics that are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Comparable: <\/strong><em>Time Delays, Latency Issues, User Permissions, Deployment Issues, Compilation Errors,<\/em><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>Differing<\/strong>: <em>Unclassified <\/em>(BERTopic classifies outliers into T-1), <em>Improvement Suggestions, Authorization Errors, Performance Complaints <\/em>(FASTopic), <em>Cloud Management, Asynchronous Requests, General Requests<\/em> (BERTopic)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">For business purposes, topics should be labelled by the company\u2019s insiders who know the customer base and the business priorities.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Model evaluation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If three out of eight classified topics are labeled differently, then which model should be deployed? Let&#8217;s now evaluate the coherence and diversity for the trained BERTopic and FASTopic T-8 models.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.1. NPMI<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">We need a reference corpus to calculate an <em>NPMI <\/em>for each model. The <a href=\"https:\/\/www.kaggle.com\/datasets\/tobiasbueck\/multilingual-customer-support-tickets\" rel=\"noreferrer noopener\" target=\"_blank\">Customer IT Support Ticket Dataset<\/a> from Kaggle, distributed with <a href=\"https:\/\/creativecommons.org\/licenses\/by\/4.0\/\" rel=\"noreferrer noopener\" target=\"_blank\">Attribution 4.0 International license<\/a>, provides comparable data to our training set. The data is filtered to 11923 English email bodies.&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><em>Calculate an NPMI for each bigram in the reference corpus with <\/em><a href=\"https:\/\/github.com\/PetrKorab\/Choose-the-Right-One-Evaluating-Topic-Models-for-Business-Intelligence\/blob\/main\/NPMI_eval.ipynb\" target=\"_blank\" rel=\"noreferrer noopener\"><em>this code<\/em><\/a><em>.<\/em><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><em>Merge bigrams predicted by FASTopic and BERTopic with their NPMI scores from the reference corpus. The fewer NaNs are in the table, the more accurate the metric is.<\/em><\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/14fHYfct2tw5Q3A6VpPZfkg.png\" alt=\"\" class=\"wp-image-602148\"\/><figcaption class=\"wp-element-caption\">Image 7: NPMI coherence evaluation.Image by&nbsp;author.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><em>3. Average NPMIs within and across topics to get a single score for each model.<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.2. SC<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">With <em>SC<\/em>, we learn the context and semantic similarity of bigrams predicted by a topic model by calculating their position in the corpus in relation to other tokens. To do so, we:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><em>Create a document-term matrix (DTM) with a count of how many times each bigram appears in each document.<\/em><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><em>Calculate topic SC scores by searching for bigram co-occurrences in the DTM and the bigrams predicted by topic models.<\/em><\/li>\n\n\n\n<li class=\"wp-block-list-item\"><em>Average topic SC to a model SC score.<\/em><\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">4.3. PUV<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Topic diversity <em>PUV <\/em>metric checks the duplicates of bigrams between topics in a model.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><em>Join bigrams into tokens by replacing spaces with underscores in the FASTopic and BERTopic tables of predicted bigrams.<\/em><\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1aqWXAlLUJNtkEYqi8q8LQA.png\" alt=\"\" class=\"wp-image-602149\"\/><figcaption class=\"wp-element-caption\">Image 8: Topic diversity illustration. Image by&nbsp;author.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><em>2. Calculate topic diversity as count of distinct tokens\/ count of tokens in the tables for both models.<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.4. Model comparison<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Let\u2019s now summarize the coherence and diversity evaluation in Image 9. BERTopic models are more coherent but less diverse than FASTopic. The differences are not very large, but BERTopic suffers from uneven distribution of incoming emails into the pipeline (see charts in Image 5). Around 32% of classified emails fall into<em> T0<\/em>, and 15% into<em> T-1<\/em>, which covers the unclassified outliers. The models are trained with a min. of 20 tokens per topic. Increasing this parameter causes the model to be unable to train, probably because of the small data size.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For this reason, FASTopic is a better choice for topic modelling in email classification with small training datasets.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1cn80M85LolcFDDzb2kwjwQ.png\" alt=\"\" class=\"wp-image-602145\"\/><figcaption class=\"wp-element-caption\">Image 9: Topic model evaluation metrics. Image by&nbsp;author.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The last step is to deploy the model with topic labels in the email platform to classify incoming emails:<\/p>\n\n\n\n<figure class=\"wp-block-image alignwide\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/04\/1TbsfZVqkKL6gI2OkVzSZ4w.png\" alt=\"\" class=\"wp-image-602150\"\/><figcaption class=\"wp-element-caption\">Image 10. Topic model classification pipeline, output. Image by&nbsp;author.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Coherence and diversity metrics compare models with similar training setups, the same dataset, and cleaning strategy. We cannot compare their absolute values with the results of different training sessions. But they help us decide on the best model for our specific use case. They offer a <strong><em>relative comparison<\/em><\/strong> of various model specifications and help decide which model should be deployed in the pipeline. Topic models evaluation should always be the last step before model deployment in business practice.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">How does customer care benefit from the topic modelling exercise? After the topic model is put into production, the pipeline sends a classified topic for each email to the email platform that Customer Care uses for communicating with customers. With a limited staff, it is now possible to prioritize and respond faster to the most sensitive business requests (such as <em>\u201ctime delays<\/em>\u201d and <em>\u201clatency issues\u201d<\/em>), and change priorities dynamically.&nbsp;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Data and complete codes for this tutorial are <a href=\"https:\/\/github.com\/PetrKorab\/Choose-the-Right-One-Evaluating-Topic-Models-for-Business-Intelligence\" data-type=\"link\" data-id=\"https:\/\/github.com\/PetrKorab\/Choose-the-Right-One-Evaluating-Topic-Models-for-Business-Intelligence\">here<\/a>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity is-style-dotted\"\/>\n\n\n\n<p class=\"wp-block-paragraph\"><strong><em>Petr Korab<\/em><\/strong><em> is a Python Engineer and Founder of <\/em><a href=\"https:\/\/textminingstories.com\" rel=\"noreferrer noopener\" target=\"_blank\"><em>Text Mining Stories<\/em><\/a><em> with over eight years of experience in Business Intelligence and NLP.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong><em>Acknowledgments<\/em><\/strong><em>: I thank Tom\u00e1\u0161 Horsk\u00fd (Lentiamo, Prague), Martin Feldkircher, and Viktoriya Teliha (Vienna School of International Studies) for useful comments and suggestions.<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">References<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">[1] Blei, D. M., Lafferty, J. D. 2006. Dynamic topic models. In <em>Proceedings of the 23rd international conference on Machine learning (pp. 113\u2013120)<\/em>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[2] Dieng A.B., Ruiz F. J. R., and Blei D. M. 2020. Topic modeling in embedding spaces. <a href=\"https:\/\/aclanthology.org\/2020.tacl-1.29\" rel=\"noreferrer noopener\" target=\"_blank\">Transactions of the Association for Computational Linguistics<\/a>, 8:439-453.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[3] Grootendorst, M. 2022. Bertopic: Neural Topic Modeling With A Class-Based TF-IDF Procedure. <a href=\"https:\/\/arxiv.org\/abs\/2203.05794\" rel=\"noreferrer noopener\" target=\"_blank\"><em>Computer Science.<\/em><\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[4] Korab, P. Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code. <em>Towards Data Science<\/em>. 22.1.2025. Accessible from: <a href=\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/\" target=\"_blank\" rel=\"noreferrer noopener\">link<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[5] Korab, P. Topic Modelling with BERTtopic in Python. <em>Towards Data Science<\/em>. 4.1.2024. Accessible from: <a href=\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/\" target=\"_blank\" rel=\"noreferrer noopener\">link<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[6] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. 2024. <a href=\"https:\/\/arxiv.org\/abs\/2405.17978\" rel=\"noreferrer noopener\" target=\"_blank\">FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm<\/a>. arXiv preprint: 2405.17978.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[7] Mimno, D., Wallach, H., M., Talley, E., Leenders, M, McCallum. A. 2011. <a href=\"https:\/\/aclanthology.org\/D11-1024\/\" rel=\"noreferrer noopener\" target=\"_blank\">Optimizing Semantic Coherence in Topic Models.<\/a> <em>Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">[8] Prostmaier, B., V\u00e1vra, J., Gr\u00fcn, B., Hofmarcher., P. 2025. <a href=\"https:\/\/arxiv.org\/abs\/2503.02741\" rel=\"noreferrer noopener\" target=\"_blank\">Seeded Poisson Factorization: Leveraging domain knowledge to fit topic models.<\/a> arXiv preprint: 2405.17978.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Python tutorial for evaluating top-notch bigram topic models in customer email classification<\/p>\n","protected":false},"author":18,"featured_media":605802,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"is_member_only":false,"sub_heading":"Python tutorial for evaluating top-notch bigram topic models in customer email classification","footnotes":""},"categories":[22],"tags":[11086,453,1120,770,512],"sponsor":[],"coauthors":[30697],"class_list":["post-605801","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","tag-bertopic","tag-editors-pick","tag-model-evaluation","tag-text-classification","tag-topic-modeling"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Choose the Right One: Evaluating Topic Models for Business Intelligence | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Choose the Right One: Evaluating Topic Models for Business Intelligence | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"Python tutorial for evaluating top-notch bigram topic models in customer email classification\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2025-04-24T19:50:50+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-04-24T19:51:05+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/04\/pushpins-with-thread-route-map-scaled-1.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1652\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Petr Kor\u00e1b\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Petr Kor\u00e1b\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Choose the Right One: Evaluating Topic Models for Business Intelligence\",\"datePublished\":\"2025-04-24T19:50:50+00:00\",\"dateModified\":\"2025-04-24T19:51:05+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/\"},\"wordCount\":2215,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/04\/pushpins-with-thread-route-map-scaled-1.jpg\",\"keywords\":[\"Bertopic\",\"Editors Pick\",\"Model Evaluation\",\"Text Classification\",\"Topic Modeling\"],\"articleSection\":[\"Machine Learning\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/\",\"url\":\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/\",\"name\":\"Choose the Right One: Evaluating Topic Models for Business Intelligence | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/04\/pushpins-with-thread-route-map-scaled-1.jpg\",\"datePublished\":\"2025-04-24T19:50:50+00:00\",\"dateModified\":\"2025-04-24T19:51:05+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/04\/pushpins-with-thread-route-map-scaled-1.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/04\/pushpins-with-thread-route-map-scaled-1.jpg\",\"width\":2560,\"height\":1652,\"caption\":\"Source: Freepic\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Choose the Right One: Evaluating Topic Models for Business Intelligence\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Choose the Right One: Evaluating Topic Models for Business Intelligence | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/","og_locale":"en_US","og_type":"article","og_title":"Choose the Right One: Evaluating Topic Models for Business Intelligence | Towards Data Science","og_description":"Python tutorial for evaluating top-notch bigram topic models in customer email classification","og_url":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/","og_site_name":"Towards Data Science","article_published_time":"2025-04-24T19:50:50+00:00","article_modified_time":"2025-04-24T19:51:05+00:00","og_image":[{"width":2560,"height":1652,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/04\/pushpins-with-thread-route-map-scaled-1.jpg","type":"image\/jpeg"}],"author":"Petr Kor\u00e1b","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Petr Kor\u00e1b","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Choose the Right One: Evaluating Topic Models for Business Intelligence","datePublished":"2025-04-24T19:50:50+00:00","dateModified":"2025-04-24T19:51:05+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/"},"wordCount":2215,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/04\/pushpins-with-thread-route-map-scaled-1.jpg","keywords":["Bertopic","Editors Pick","Model Evaluation","Text Classification","Topic Modeling"],"articleSection":["Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/","url":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/","name":"Choose the Right One: Evaluating Topic Models for Business Intelligence | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/04\/pushpins-with-thread-route-map-scaled-1.jpg","datePublished":"2025-04-24T19:50:50+00:00","dateModified":"2025-04-24T19:51:05+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/04\/pushpins-with-thread-route-map-scaled-1.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/04\/pushpins-with-thread-route-map-scaled-1.jpg","width":2560,"height":1652,"caption":"Source: Freepic"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/choose-the-right-one-evaluating-topic-models-for-business-intelligence\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Choose the Right One: Evaluating Topic Models for Business Intelligence"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"TDS Contributor Portal","distributor_original_site_url":"https:\/\/contributor.insightmediagroup.io","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/605801","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=605801"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/605801\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/605802"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=605801"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=605801"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=605801"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=605801"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=605801"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}