{"id":595815,"date":"2025-01-22T18:02:13","date_gmt":"2025-01-22T18:02:13","guid":{"rendered":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/"},"modified":"2025-02-03T14:08:17","modified_gmt":"2025-02-03T14:08:17","slug":"topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/","title":{"rendered":"Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code"},"content":{"rendered":"<p class=\"wp-block-paragraph\">Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"eae4de\" data-has-transparency=\"false\" style=\"--dominant-color: #eae4de;\" loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1658\" class=\"wp-image-595816 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg\" alt=\"Source: Freepic, Image by rawpixel.com\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg 2560w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-300x194.jpeg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-1024x663.jpeg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-768x497.jpeg 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-1536x995.jpeg 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-2048x1327.jpeg 2048w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/www.freepik.com\/free-vector\/illustration-network-shareing_2808109.htm#fromView=search&amp;page=2&amp;position=3&amp;uuid=c11e1dc6-4be2-45d5-9c7b-29bc16bc69f0\">Freepic<\/a>, Image by rawpixel.com<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\"><strong>Customer reviews<\/strong> about products and services provide valuable information about customer satisfaction. They provide insight into what should be improved across the whole product development. Dynamic topic models in business intelligence can identify key product qualities and other satisfaction factors, cluster them into categories, and evaluate how business decisions materialized in customer satisfaction over time. This is highly valuable information not only for product managers.<\/p>\n<p class=\"wp-block-paragraph\">This article will compare two of the latest topic models to classify customer complaints data. <strong>BERTopic<\/strong> by <a href=\"https:\/\/arxiv.org\/pdf\/2203.05794\">Maarten Grootendorst (2022)<\/a> and the recent <strong><a href=\"https:\/\/pypi.org\/project\/fastopic\/\">FASTopic<\/a><\/strong> by <a href=\"https:\/\/arxiv.org\/pdf\/2405.17978\">Xiaobao Wu et al. (2024)<\/a> presented at last year&#8217;s <a href=\"https:\/\/neurips.cc\/virtual\/2024\/poster\/96416\">NeurIPS<\/a>, are the current leading models for topic analytics of customer data. For these models, we&#8217;ll explore in Python code:<\/p>\n<ul class=\"wp-block-list\">\n<li>how to effectively <strong>preprocess data<\/strong><\/li>\n<li>how to train a <strong>Bigram topic model<\/strong> for customer complaint analysis<\/li>\n<li>how to model <strong>topic activity<\/strong> over time.<\/li>\n<\/ul>\n<h1 class=\"wp-block-heading\">1. Customer complaints data in companies<\/h1>\n<p class=\"wp-block-paragraph\">Complaints data are generated by interaction with customers and typically recorded in <a href=\"https:\/\/www.sap.com\/products\/erp\/what-is-erp.html\">ERP systems<\/a>. There are many channels where customers can raise a concern about a product or service. Here are just a few examples:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Email<\/strong>: email communication is stored for the BI team, e.g., in the SQL database.<\/li>\n<li><strong>After-purchase<\/strong> <strong>survey<\/strong>: feedback sent to customers after product purchase. Companies either send the emails themselves or use a price comparison website (e.g., <a href=\"https:\/\/www.billiger-mietwagen.de\/\">Billiger <\/a>in Germany) where customers order the product.<\/li>\n<li><strong>Phone transcriptions<\/strong>: after prior consent from a customer, some companies record the phone communication with customers, which is then available for the BI team.<\/li>\n<li><strong>Google reviews:<\/strong> customers leave comments and reviews on products and services worldwide. Google enables authorized users to <a href=\"https:\/\/takeout.google.com\/?pli=1\">export the data <\/a>not only for text mining purposes.<\/li>\n<li><strong>Review platforms:<\/strong> independent review platforms **** (such as <a href=\"https:\/\/www.trustpilot.com\/\">Trustpilot<\/a>) offer customers a place to provide feedback to brands and companies. This data is available through <a href=\"https:\/\/developers.trustpilot.com\/service-reviews-api\">various APIs.<\/a><\/li>\n<li><strong>Social media<\/strong> <strong>conversations<\/strong>: Instagram, X, and Facebook are full of product or brand-related comments. The simplest way is to use an official API to collect the data. For Instagram and Facebook, go to the <a href=\"https:\/\/developers.facebook.com\/\">developers&#8217; portal<\/a> to receive an API key. X works the <a href=\"https:\/\/developer.x.com\/en\/docs\/x-api\">same way.<\/a><\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">2. Example data<\/h2>\n<p class=\"wp-block-paragraph\">As example data, we&#8217;ll use the <strong><a href=\"https:\/\/huggingface.co\/datasets\/NiyatiC\/amazon_food_reviews\">Amazon Dog Food Reviews<\/a><\/strong> dataset from <strong>Hugging Face<\/strong>, released under the <a href=\"https:\/\/github.com\/huggingface\/datasets\/blob\/main\/LICENSE\">Apache-2.0 license<\/a>. The subset for topic modeling only contains 3693 customer reviews collected over 02\/01\/2016: 31\/12\/2020. Here is what the data looks like:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ededed\" data-has-transparency=\"true\" style=\"--dominant-color: #ededed;\" loading=\"lazy\" decoding=\"async\" width=\"1651\" height=\"590\" class=\"wp-image-595817 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1eEI-zr7zQNI6Tsau4QW-ww.png\" alt=\"Image 1. Amazon dog food reviews dataset\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1eEI-zr7zQNI6Tsau4QW-ww.png 1651w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1eEI-zr7zQNI6Tsau4QW-ww-300x107.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1eEI-zr7zQNI6Tsau4QW-ww-1024x366.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1eEI-zr7zQNI6Tsau4QW-ww-768x274.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1eEI-zr7zQNI6Tsau4QW-ww-1536x549.png 1536w\" sizes=\"auto, (max-width: 1651px) 100vw, 1651px\" \/><figcaption class=\"wp-element-caption\">Image 1. Amazon dog food reviews dataset<\/figcaption><\/figure>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f0f0f0\" data-has-transparency=\"false\" style=\"--dominant-color: #f0f0f0;\" loading=\"lazy\" decoding=\"async\" width=\"3338\" height=\"125\" class=\"wp-image-595818 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1fIP08_8zJJTRcWAn6DRdiQ.png\" alt=\"Image 2. General preprocessing steps for (customer feedback) topic modeling. Image by author\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1fIP08_8zJJTRcWAn6DRdiQ.png 3338w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1fIP08_8zJJTRcWAn6DRdiQ-300x11.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1fIP08_8zJJTRcWAn6DRdiQ-1024x38.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1fIP08_8zJJTRcWAn6DRdiQ-768x29.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1fIP08_8zJJTRcWAn6DRdiQ-1536x58.png 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1fIP08_8zJJTRcWAn6DRdiQ-2048x77.png 2048w\" sizes=\"auto, (max-width: 3338px) 100vw, 3338px\" \/><figcaption class=\"wp-element-caption\">Image 2. General preprocessing steps for (customer feedback) topic modeling. Image by author<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">3. Data preprocessing<\/h2>\n<p class=\"wp-block-paragraph\">Processing data systematically in the right order keeps the essential information and does not add a new bias. Let&#8217;s go along these steps:<\/p>\n<ul class=\"wp-block-list\">\n<li><em><strong>#1: Numbers:<\/strong><\/em> digits are typically the characters to remove in the first step.<\/li>\n<li><em><strong>#2: Emoticons:<\/strong><\/em> product reviews are typically full of them. For topic modeling in customer reviews, emojis don&#8217;t have much significance.<\/li>\n<li><em><strong>#3: Stopwords:<\/strong><\/em> apart from <a href=\"https:\/\/www.geeksforgeeks.org\/removing-stop-words-nltk-python\/\">standard stopwords<\/a>, it is common to remove an <a href=\"https:\/\/github.com\/PetrKorab\/Arabica\/blob\/main\/stopwords_extended.py\">extended <\/a>stopwords list for one or more languages.<\/li>\n<li><em><strong>#4: Punctuation:<\/strong><\/em> general language has a myriad of special characters and punctuation, which should be cleaned in this step.<\/li>\n<li><em><strong>#5: Additional stopwords:<\/strong><\/em> depending on the use case, some additional words are also useful to remove. With the Amazon dog food reviews, these are <em>&quot;dog&quot;, &quot;food&quot;, &quot;blue&quot;, &quot;buffalo&quot;, &quot;ha&quot;, &quot;month&quot;<\/em>, and <em>&quot;ago&quot;.<\/em><\/li>\n<\/ul>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>&quot;Delivery&quot; and &quot;deliveries&quot;, &quot;box&quot; and &quot;Boxes&quot;, or &quot;Price&quot; and &quot;prices&quot; share the same word root, but without lemmatization, topic models would model them as separate factors. That&#8217;s why product reviews should always be lemmatized in the last step of preprocessing.<\/p><\/blockquote>\n<ul class=\"wp-block-list\">\n<li><em><strong>#6: Lemmatization:<\/strong><\/em> groups words into a single form (the lemma), keeping the word root information and semantics.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">Text preprocessing is model-specific:<\/p>\n<ul class=\"wp-block-list\">\n<li><em><strong>FASTopic<\/strong><\/em> works with clean data on input; some cleaning (stopwords) can be done during the training. The simplest and most effective way is to use the <em><strong><a href=\"https:\/\/washer.textminingstories.com\/\">Washer: The no-code app for text data cleaning<\/a><\/strong><\/em> offering a no-code way of processing data for text mining projects.<\/li>\n<li><em><strong>BERTopic:<\/strong><\/em> the<a href=\"https:\/\/maartengr.github.io\/BERTopic\/faq.html#how-do-i-reduce-topic-outliers\"> documentation<\/a> recommends that &quot; r<em>emoving stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context to create accurate embeddings&quot;.<\/em> It uses transformers based on real text, not clean text without stopwords, lemmas, or tokens. For this reason, cleaning operations should be included in the model training.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"dbbcb2\" data-has-transparency=\"false\" style=\"--dominant-color: #dbbcb2;\" loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1280\" class=\"wp-image-595819 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13f-unVEhotrTKCwI-VtWw-scaled.jpeg\" alt=\"Source: Freepic, Image by macrovector\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13f-unVEhotrTKCwI-VtWw-scaled.jpeg 2560w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13f-unVEhotrTKCwI-VtWw-300x150.jpeg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13f-unVEhotrTKCwI-VtWw-1024x512.jpeg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13f-unVEhotrTKCwI-VtWw-768x384.jpeg 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13f-unVEhotrTKCwI-VtWw-1536x768.jpeg 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13f-unVEhotrTKCwI-VtWw-2048x1024.jpeg 2048w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><figcaption class=\"wp-element-caption\">Source: <a href=\"https:\/\/www.freepik.com\/free-vector\/romantic-train-trip-background_5972223.htm#fromView=search&amp;page=1&amp;position=24&amp;uuid=1b2ccb61-1b56-453e-b67a-befe1eba6858\">Freepic<\/a>, Image by macrovector<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">4. Topic modeling with top-notch models<\/h2>\n<p class=\"wp-block-paragraph\">Let&#8217;s now check how the satisfaction factors are distributed across the topics. The questions we ask here are:<\/p>\n<ul class=\"wp-block-list\">\n<li><em>What were the key problems and qualities customers reported on the product?<\/em><\/li>\n<li><em>How has product satisfaction changed over time?<\/em><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The <a href=\"https:\/\/arxiv.org\/pdf\/2203.05794\">BERTopi<\/a>c and <a href=\"https:\/\/arxiv.org\/pdf\/2405.17978\">FASTopic<\/a> papers describe the model architectures in detail. Also, my <a href=\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\">TDS tutorial<\/a> on topic modeling explains topic classification with BERTopic on a political speech dataset.<\/p>\n<h3 class=\"wp-block-heading\">4.1. FASTopic<\/h3>\n<p class=\"wp-block-paragraph\">Import the libraries and the data (complete code and the requirements are <a href=\"https:\/\/github.com\/PetrKorab\/Topic-Modelling-in-Business-Intelligence-BERTopic-and-FASTopic-in-Code\/tree\/main\">here<\/a>). Then, create a list of clean reviews:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import pandas as pd\nfrom fastopic import FASTopic\nfrom sklearn.feature_extraction.text import CountVectorizer \nfrom topmost.preprocessing import Preprocessing             \n\n# create a list of reviews\ndocs = data[&#039;clean_text&#039;].tolist()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">In FASTopic, bigram generation is not directly implemented. To solve this, we will make a bigram preprocessing class. The model works with bigrams as with individual tokens, so we join the words in bigrams with underscores.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># custom preprocessing class with bigram generation\nclass NgramPreprocessing:\n    def __init__(self, ngram_range=(1, 1), \n                       vocab_size=10000, \n                       stopwords=&#039;English&#039;): \n\n        self.ngram_range = ngram_range\n        self.preprocessing = Preprocessing(vocab_size=vocab_size, \n                                           stopwords=stopwords)\n\n        # use a custom analyzer to join bigrams with &quot;_&quot;\n        self.vectorizer = CountVectorizer(ngram_range=self.ngram_range, \n                                          max_features=vocab_size, \n                                          analyzer=self._custom_analyzer)\n\n        # custom analyzer function to join bigrams with underscores\n    def _custom_analyzer(self, doc):\n        # tokenize the document and create bigrams\n        tokens = CountVectorizer(ngram_range=self.ngram_range).build_analyzer()(doc)\n\n        # replace spaces in bigrams with &quot;_&quot;\n        return [token.replace(&quot; &quot;, &quot;_&quot;) for token in tokens]\n\n    def preprocess(self, \n                   docs, \n                   pretrained_WE=False):\n\n        parsed_docs = self.preprocessing.preprocess(docs, \n                      pretrained_WE=pretrained_WE)[&quot;train_texts&quot;]\n        train_bow = self.vectorizer.fit_transform(parsed_docs).toarray()\n        rst = {\n            &quot;train_bow&quot;: train_bow,\n            &quot;train_texts&quot;: parsed_docs,\n            &quot;vocab&quot;: self.vectorizer.get_feature_names_out()\n        }\n        return rst\n\n# initialize preprocessing with bigrams\nngram_preprocessing = NgramPreprocessing(ngram_range=(2, 2))<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Let&#8217;s train the model for eight topics and display the top 20 bigrams for each topic in a data frame. We train on single tokens, then remove the underscores generating the bigrams.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># model training\nmodel = FASTopic(8, ngram_preprocessing, num_top_words=10000)\n\n# fit model to documents\ntopic_top_words, doc_topic_dist = model.fit_transform(docs)\n\n# retrieve 20 bigrams for each topic\nimport pandas as pd\n\nmax_bigrams = 20\n\n# Retrieve the bigrams for each topic and select only the word columns\ntopic_0 = pd.DataFrame(model.get_topic(0, max_bigrams), columns=[&quot;Topic_0_word&quot;, &quot;Topic_0_prob&quot;])[[&quot;Topic_0_word&quot;]]\ntopic_1 = pd.DataFrame(model.get_topic(1, max_bigrams), columns=[&quot;Topic_1_word&quot;, &quot;Topic_1_prob&quot;])[[&quot;Topic_1_word&quot;]]\ntopic_2 = pd.DataFrame(model.get_topic(2, max_bigrams), columns=[&quot;Topic_2_word&quot;, &quot;Topic_2_prob&quot;])[[&quot;Topic_2_word&quot;]]\ntopic_3 = pd.DataFrame(model.get_topic(3, max_bigrams), columns=[&quot;Topic_3_word&quot;, &quot;Topic_3_prob&quot;])[[&quot;Topic_3_word&quot;]]\ntopic_4 = pd.DataFrame(model.get_topic(4, max_bigrams), columns=[&quot;Topic_4_word&quot;, &quot;Topic_4_prob&quot;])[[&quot;Topic_4_word&quot;]]\ntopic_5 = pd.DataFrame(model.get_topic(5, max_bigrams), columns=[&quot;Topic_5_word&quot;, &quot;Topic_5_prob&quot;])[[&quot;Topic_5_word&quot;]]\ntopic_6 = pd.DataFrame(model.get_topic(6, max_bigrams), columns=[&quot;Topic_6_word&quot;, &quot;Topic_6_prob&quot;])[[&quot;Topic_6_word&quot;]]\ntopic_7 = pd.DataFrame(model.get_topic(7, max_bigrams), columns=[&quot;Topic_7_word&quot;, &quot;Topic_7_prob&quot;])[[&quot;Topic_7_word&quot;]]\n\n# concatenate the dataframes\ntopics_df = pd.concat([topic_0,topic_1, topic_2, topic_3, topic_4,topic_5,topic_6,topic_7], axis=1)\n\n# remove underscores from bigrams\ntopics_df = topics_df.applymap(lambda x: x.replace(&#039;_&#039;, &#039; &#039;) if isinstance(x, str) else x)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We&#8217;ve modeled the customer satisfaction factors with a dog food product in eight distinct topics. Here are the manually annotated topic names:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"dce3d7\" data-has-transparency=\"true\" style=\"--dominant-color: #dce3d7;\" loading=\"lazy\" decoding=\"async\" width=\"1637\" height=\"636\" class=\"wp-image-595820 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1uO8fysZ-mbmaNmpqcnTRcw.png\" alt=\"Image 3: Satisfaction factors modeling with FASTopic. Image by author\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1uO8fysZ-mbmaNmpqcnTRcw.png 1637w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1uO8fysZ-mbmaNmpqcnTRcw-300x117.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1uO8fysZ-mbmaNmpqcnTRcw-1024x398.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1uO8fysZ-mbmaNmpqcnTRcw-768x298.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1uO8fysZ-mbmaNmpqcnTRcw-1536x597.png 1536w\" sizes=\"auto, (max-width: 1637px) 100vw, 1637px\" \/><figcaption class=\"wp-element-caption\">Image 3: Satisfaction factors modeling with FASTopic. Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">FASTopic returns relatively distinct topics, sorting the comments of the customers:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>0: Negative health effects,<\/strong> <em>&quot;sensitive stomach&quot;, &quot;small bite&quot;, &quot;stomach issue&quot;, &quot;lose weight&quot;, &quot;refuse eat&quot;, &quot;taste wild&quot;, &quot;digestive issue&quot;, &quot;upset stomach&quot;, &quot;stop eat&quot;, &quot;gain weight&quot;<\/em><\/li>\n<li><strong>1: Food quality,<\/strong> <em>&quot;love flavor&quot;, &quot;quality ingredient&quot;, &quot;good ingredient&quot;, &quot;healthy ingredient&quot;, &quot;ingredient quality&quot;, &quot;flavor good&quot;, &quot;taste great&quot;, &quot;healthy love&quot;, &quot;great healthy&quot;, &quot;good healthy&quot;, &quot;good health&quot;, &#8230;<\/em><\/li>\n<li><strong>2: Positive health effects,<\/strong> <em>&quot;healthy fur&quot;, &quot;awesome pup&quot;, &quot;eye bright&quot;<\/em><\/li>\n<li><strong>3: Digestion effects,<\/strong> <em>&quot;smell bad&quot;, &quot;runny poop&quot;, &quot;horrible gas&quot;, &quot;diarrhea vet&quot;, &quot;terrible diarrhea,&quot; &quot;sick week&quot;, &quot;sick buy&quot;, &quot;day vomit&quot;<\/em><\/li>\n<li><strong>4: Pricing,<\/strong> <em>&quot;great price&quot;, &quot;good price&quot;, &quot;love price&quot;, &quot;price great&quot;, &quot;love cheap&quot;, &quot;price deliver&quot;, &quot;great deal&quot;, &quot;price increase&quot;, &quot;free shipping&quot;, &#8230;<\/em><\/li>\n<li><strong>5: Other,<\/strong> other factors.<\/li>\n<li><strong>6: Fur effects,<\/strong> <em>&quot;coat shiny&quot;, &quot;fur baby&quot;, &quot;skin issue&quot;, &quot;shiny coat&quot;, &quot;love coat&quot;, &quot;coat soft&quot;<\/em><\/li>\n<li><strong>7: Delivery,<\/strong> <em>&quot;open box&quot;, &quot;bag rip&quot;, &quot;big bag&quot;, &quot;hole bag&quot;, &quot;open bag&quot;, &quot;inside box&quot;, &quot;bag open&quot;, &quot;bag hole&quot;, &quot;heavy bag&quot;, &quot;rip open&quot;, &#8230;<\/em><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">It is also useful to check the weight of these categories in the data. The full code is <a href=\"https:\/\/github.com\/PetrKorab\/Topic-Modelling-in-Business-Intelligence-BERTopic-and-FASTopic-in-Code\/blob\/main\/FASTopic.ipynb\">here<\/a>.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>We&#8217;ve modeled the customer satisfaction factors with a dog food product. But why is it beneficial for companies? Dynamic topic models offer a straightforward way of monitoring customer satisfaction over time. They indicate product-related problems and help take the right measures. Once the business decisions are taken into action, topic models check if they have an effect over time.<\/p><\/blockquote>\n<p class=\"wp-block-paragraph\">To do so, let&#8217;s model topic activity over time at a quarterly frequency.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import plotly.graph_objects as go\n\n# convert date column to datetime\ndata[&#039;time&#039;] = pd.to_datetime(data[&#039;time&#039;])\n\n# format date column to quarterly periods\ndata[&#039;date_quarterly&#039;] = data[&#039;time&#039;].dt.to_period(&#039;Q&#039;).astype(str)\n\nperiods = data[&#039;date_quarterly&#039;].tolist()\n\n# calculate topic activity over time\nact = model.topic_activity_over_time(periods)\n\n# visualize topic activity\nfig = model.visualize_topic_activity(top_n=8, topic_activity=act, time_slices=periods)\n\n# update legend to display only the topic number\nfig.data = sorted(fig.data, key=lambda trace: trace.name)\n\nfor trace in fig.data:\n    trace.name = trace.name[0]\n\n# update the layout\nfig.update_layout(\n    width=1200,\n    height=600,\n    title=&#039;&#039;,\n    legend_title_text=&#039;Topic&#039;,\n    xaxis_tickangle=45         # set x-axis labels to 45-degree angle\n)\n\n# show the figure\nfig.show()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The delivery problems in topic 7 peaked in Q3 2018. Customers complained about open and rip boxes much more often, but these problems were fixed in early 2019 (see the picture below).<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f9faf8\" data-has-transparency=\"true\" style=\"--dominant-color: #f9faf8;\" loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"600\" class=\"wp-image-595821 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1V0TDmxeuw_fxPINZb4mAHw.png\" alt=\"Image 4: Topic activity over time, FASTopic. Image by author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1V0TDmxeuw_fxPINZb4mAHw.png 1200w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1V0TDmxeuw_fxPINZb4mAHw-300x150.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1V0TDmxeuw_fxPINZb4mAHw-1024x512.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1V0TDmxeuw_fxPINZb4mAHw-768x384.png 768w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><figcaption class=\"wp-element-caption\">Image 4: Topic activity over time, FASTopic. Image by author.<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">4.2. BERTopic<\/h3>\n<p class=\"wp-block-paragraph\">BERTopic implements bigrams with _<a href=\"https:\/\/scikit-learn.org\/1.5\/modules\/generated\/sklearn.feature_extraction.text.CountVectorizer.html\">vectorizer_model<\/a>,_ which also works as a data processing pipeline. The code and the requirements are <a href=\"https:\/\/github.com\/PetrKorab\/Topic-Modelling-in-Business-Intelligence-BERTopic-and-FASTopic-in-Code\/tree\/main\">here<\/a>.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from bertopic import BERTopic\nfrom umap import UMAP\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom nltk.corpus import stopwords\nimport nltk\nfrom nltk import word_tokenize          \nfrom nltk.stem import WordNetLemmatizer\nimport pandas as pd\nimport re\n\nnltk.download(&#039;stopwords&#039;)\n\n# create a list of speeches\ndocs = data[&#039;text&#039;].tolist()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">We train on raw data and clean it with the vectorizer. During the training, the vectorizer cleans data from numbers and stopwords, returning lemmatized tokens for the bigram model.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># create stopwords list\nstandard_stopwords = list(stopwords.words(&#039;english&#039;))\n\n# extended list of English stopwords\nstopwords_extended = [ &quot;0o&quot;,  ..]      \n\n# additional tokens to remove\nadditional_stopwords = [&#039;blue&#039;,&#039;buffalo&#039;,&#039;dog&#039;,&#039;food&#039;,&#039;ha&#039;,&#039;month&#039;,&#039;ago&#039;] \n\n# combine standard, extended stopwords, and additional tokens\nfull_stopwords = standard_stopwords \n                 + additional_stopwords\n                 + stopwords_extended\n\n# define tokenizer retrurning lemmatized text without numbers\nclass LemmaTokenizer:\n    def __init__(self):\n        self.wnl = WordNetLemmatizer()\n    def __call__(self, doc):\n        doc = re.sub(r&#039;d+&#039;, &#039;&#039;, doc)  # clean numbers\n        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)] # lemmatize\n\n# vectorizer makes data processing and generates bigrams\nvectorizer_model = CountVectorizer(tokenizer=LemmaTokenizer(),\n                                  ngram_range=(2, 2),\n                                  stop_words=full_stopwords)\n\n# set-up model\nmodel = BERTopic(n_gram_range=(2,2), # returns bigrams\n                nr_topics=9,         # generate 9 topics, leave -1 for outliers\n                top_n_words=20,      # return top 20 bigrams\n                min_topic_size=20,   # topics contains at least 20 tokens\n                vectorizer_model=vectorizer_model,\n                umap_model = UMAP(random_state=1))  # setting seed topics reproduce\n\n# fit model to data\ntopics, probabilities = model.fit_transform(docs)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Next, let&#8217;s prepare a dataframe with tokens from the model.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import pandas as pd\n\n# retrieve bigrams for each topic and select only the word columns\ntopic_0 = pd.DataFrame(model.get_topic(0), columns=[&quot;Topic_0_word&quot;, &quot;Topic_0_prob&quot;])[[&quot;Topic_0_word&quot;]]\ntopic_1 = pd.DataFrame(model.get_topic(1), columns=[&quot;Topic_1_word&quot;, &quot;Topic_1_prob&quot;])[[&quot;Topic_1_word&quot;]]\ntopic_2 = pd.DataFrame(model.get_topic(2), columns=[&quot;Topic_2_word&quot;, &quot;Topic_2_prob&quot;])[[&quot;Topic_2_word&quot;]]\ntopic_3 = pd.DataFrame(model.get_topic(3), columns=[&quot;Topic_3_word&quot;, &quot;Topic_3_prob&quot;])[[&quot;Topic_3_word&quot;]]\ntopic_4 = pd.DataFrame(model.get_topic(4), columns=[&quot;Topic_4_word&quot;, &quot;Topic_4_prob&quot;])[[&quot;Topic_4_word&quot;]]\ntopic_5 = pd.DataFrame(model.get_topic(5), columns=[&quot;Topic_5_word&quot;, &quot;Topic_5_prob&quot;])[[&quot;Topic_5_word&quot;]]\ntopic_6 = pd.DataFrame(model.get_topic(6), columns=[&quot;Topic_6_word&quot;, &quot;Topic_6_prob&quot;])[[&quot;Topic_6_word&quot;]]\ntopic_7 = pd.DataFrame(model.get_topic(7), columns=[&quot;Topic_7_word&quot;, &quot;Topic_7_prob&quot;])[[&quot;Topic_7_word&quot;]]\n\n# concatenate the dataframes\ntopics_df = pd.concat([topic_0, topic_1, topic_2, topic_3, topic_4, \n                       topic_5, topic_6,topic_7], axis=1)<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The annotated topics show similar categorization to FASTopic. The differences are categorizing Spanish tokens into a separate topic (T7) and filling T1 and T5 with adjectives of positive meaning. Delivery problems in T4 are identical to FASTopic&#8217;s classification.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e8dedc\" data-has-transparency=\"true\" style=\"--dominant-color: #e8dedc;\" loading=\"lazy\" decoding=\"async\" width=\"1826\" height=\"746\" class=\"wp-image-595822 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1vtBKPld2fjsBxlz5bnaqqw.png\" alt=\"Image 5: Satisfaction factors modeling with BERTopic. Image by author\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1vtBKPld2fjsBxlz5bnaqqw.png 1826w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1vtBKPld2fjsBxlz5bnaqqw-300x123.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1vtBKPld2fjsBxlz5bnaqqw-1024x418.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1vtBKPld2fjsBxlz5bnaqqw-768x314.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1vtBKPld2fjsBxlz5bnaqqw-1536x628.png 1536w\" sizes=\"auto, (max-width: 1826px) 100vw, 1826px\" \/><figcaption class=\"wp-element-caption\">Image 5: Satisfaction factors modeling with BERTopic. Image by author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Again, let&#8217;s focus on topic activity over time, which gives dynamic topic models additional value for BI. BERTopic uses token frequencies (not topic weights as FASTopic) for topic activity analysis.<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># topic activity over time\nimport plotly.graph_objects as go\n\n# create timestamps\ndata[&#039;time&#039;] = pd.to_datetime(data[&#039;time&#039;])\ntimestamps = data[&#039;time&#039;].to_list()\n\n# generate topics over time, 20 bins correspond to Q frequency\ntopics_over_time = model.topics_over_time(docs, timestamps, nr_bins=20)\n\n# filter out topic -1 containing outliers\ntopics_over_time_filtered = topics_over_time[topics_over_time[&#039;Topic&#039;] != -1]\n\n# visualize the filtered topics over time\nfig = model.visualize_topics_over_time(topics_over_time_filtered)\n\n# update legend to display only the topic number\nfig.data = sorted(fig.data, key=lambda trace: trace.name)\n\nfor trace in fig.data:\n    trace.name = trace.name[0]\n\n# update the layout\nfig.update_layout(\n    width=1200,\n    height=600,\n    title=&#039;&#039;,\n    legend_title_text=&#039;Topic&#039;,\n    xaxis_tickangle=45           # set x-axis labels to 45-degree angle\n)\n\n# show the figure\nfig.show()<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Most topics are stable over time, except T4, which categorizes delivery problems. As with FASTopic, BERTopic shows that customers&#8217; negative complaints about damaged boxes rose in mid-2018.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"fbfbf9\" data-has-transparency=\"true\" style=\"--dominant-color: #fbfbf9;\" loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"600\" class=\"wp-image-595823 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13VYU_pNq_FeIWj0eIjhBVg.png\" alt=\"Image 6: Topic activity over time, BERTopic. Image by author.\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13VYU_pNq_FeIWj0eIjhBVg.png 1200w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13VYU_pNq_FeIWj0eIjhBVg-300x150.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13VYU_pNq_FeIWj0eIjhBVg-1024x512.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/13VYU_pNq_FeIWj0eIjhBVg-768x384.png 768w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><figcaption class=\"wp-element-caption\">Image 6: Topic activity over time, BERTopic. Image by author.<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n<p class=\"wp-block-paragraph\">Both models indicated delivery problems in mid-2018, which vanished in early 2019. With a topic model API monitoring customer comments on various channels, these problems can be fixed before they have a harmful effect on the brand.<\/p>\n<p class=\"wp-block-paragraph\">The right <strong>data processing is essential<\/strong> for topic models to make sense in the applied world. Cleaning text in the right order minimizes the bias of each cleaning operation. Numbers and emoticons are typically removed first, followed by stopwords. Punctuation is cleaned afterward so that stopwords don&#8217;t break up into two tokens (&quot;we&#8217;ve&quot; -&gt; &quot;we&quot; + &#8216;ve&quot;). Additional tokens are removed in the next step in the clean data before lemmatization, which unifies tokens with the same semantics.<\/p>\n<p class=\"wp-block-paragraph\"><strong>FASTopic<\/strong> deserves much better <a href=\"https:\/\/github.com\/BobXWu\/FASTopic\">documentation<\/a>, which now provides only basic information. Especially because its (1) simplicity of use and (2) stability in training on small datasets makes it a top-notch alternative to BERTopic. It is mainly practical for small companies like e-shops that typically don&#8217;t collect large text datasets and seek simple and efficient solutions. Data and full codes for this tutorial <a href=\"https:\/\/github.com\/PetrKorab\/Topic-Modelling-in-Business-Intelligence-BERTopic-and-FASTopic-in-Code\/tree\/main\">here<\/a>.<\/p>\n<p class=\"wp-block-paragraph\"><em>If you enjoy my work, you can invite me <a href=\"https:\/\/www.buymeacoffee.com\/petrkorab\">for coffee<\/a> and support my writing. You can also subscribe to my <a href=\"https:\/\/medium.com\/subscribe\/@petrkorab\">email list<\/a> to get notified about my new articles. Thanks!<\/em><\/p>\n<h2 class=\"wp-block-heading\">References<\/h2>\n<p class=\"wp-block-paragraph\">[1] Grootendorst (2022). Bertopic: Neural Topic Modeling With A Class-Based TF-IDF Procedure. <em><a href=\"https:\/\/arxiv.org\/abs\/2203.05794\">Computer Science.<\/a><\/em><\/p>\n<p class=\"wp-block-paragraph\">[2] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. (2024). FASTopic: <a href=\"https:\/\/arxiv.org\/abs\/2405.17978\">A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm<\/a>. arXiv preprint: 2405.17978.<\/p>","protected":false},"excerpt":{"rendered":"<p>A comparison of two cutting-edge dynamic topic models solving consumer complaints classification exercise<\/p>\n","protected":false},"author":18,"featured_media":595816,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"is_member_only":true,"sub_heading":"A comparison of two cutting-edge dynamic topic models solving consumer complaints classification exercise","footnotes":""},"categories":[22,23],"tags":[11086,446,568,1604,7172],"sponsor":[],"coauthors":[30697],"class_list":["post-595815","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","category-nlp","tag-bertopic","tag-machine-learning","tag-nlp","tag-text-mining","tag-topic-modelling"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"A comparison of two cutting-edge dynamic topic models solving consumer complaints classification exercise\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2025-01-22T18:02:13+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-02-03T14:08:17+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1658\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Petr Kor\u00e1b\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Petr Kor\u00e1b\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"14 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code\",\"datePublished\":\"2025-01-22T18:02:13+00:00\",\"dateModified\":\"2025-02-03T14:08:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/\"},\"wordCount\":1766,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg\",\"keywords\":[\"Bertopic\",\"Machine Learning\",\"NLP\",\"Text Mining\",\"Topic Modelling\"],\"articleSection\":[\"Machine Learning\",\"Natural Language Processing\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/\",\"url\":\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/\",\"name\":\"Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg\",\"datePublished\":\"2025-01-22T18:02:13+00:00\",\"dateModified\":\"2025-02-03T14:08:17+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg\",\"width\":2560,\"height\":1658,\"caption\":\"Source: Freepic, Image by rawpixel.com\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/","og_locale":"en_US","og_type":"article","og_title":"Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code | Towards Data Science","og_description":"A comparison of two cutting-edge dynamic topic models solving consumer complaints classification exercise","og_url":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/","og_site_name":"Towards Data Science","article_published_time":"2025-01-22T18:02:13+00:00","article_modified_time":"2025-02-03T14:08:17+00:00","og_image":[{"width":2560,"height":1658,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg","type":"image\/jpeg"}],"author":"Petr Kor\u00e1b","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Petr Kor\u00e1b","Est. reading time":"14 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code","datePublished":"2025-01-22T18:02:13+00:00","dateModified":"2025-02-03T14:08:17+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/"},"wordCount":1766,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg","keywords":["Bertopic","Machine Learning","NLP","Text Mining","Topic Modelling"],"articleSection":["Machine Learning","Natural Language Processing"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/","url":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/","name":"Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg","datePublished":"2025-01-22T18:02:13+00:00","dateModified":"2025-02-03T14:08:17+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/01\/1BMLekG4e8DwNWMfQpRy4ag-scaled.jpeg","width":2560,"height":1658,"caption":"Source: Freepic, Image by rawpixel.com"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/topic-modelling-in-business-intelligence-fastopic-and-bertopic-in-code-2d3949260a37\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Topic Modelling in Business Intelligence: FASTopic and BERTopic in Code"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Towards Data Science","distributor_original_site_url":"https:\/\/towardsdatascience.com","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/595815","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=595815"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/595815\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/595816"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=595815"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=595815"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=595815"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=595815"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=595815"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}