{"id":130690,"date":"2024-04-01T12:25:04","date_gmt":"2024-04-01T12:25:04","guid":{"rendered":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/"},"modified":"2025-03-05T07:19:20","modified_gmt":"2025-03-05T12:19:20","slug":"topic-modelling-with-berttopic-in-python-8a80d529de34","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/","title":{"rendered":"Topic Modelling with BERTtopic in Python"},"content":{"rendered":"<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"dee9f1\" data-has-transparency=\"false\" style=\"--dominant-color: #dee9f1;\" loading=\"lazy\" decoding=\"async\" width=\"2161\" height=\"1462\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg\" alt=\"Photo by Harryarts on Freepik\" class=\"wp-image-130691 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg 2161w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA-300x203.jpeg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA-1024x693.jpeg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA-768x520.jpeg 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA-1536x1039.jpeg 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA-2048x1386.jpeg 2048w\" sizes=\"auto, (max-width: 2161px) 100vw, 2161px\" \/><figcaption class=\"wp-element-caption\">Photo by <a href=\"https:\/\/www.freepik.com\/author\/harryarts\">Harryarts<\/a> on <a href=\"https:\/\/www.freepik.com\/free-vector\/modern-molecules-background_1186441.htm#fromView=search&amp;page=2&amp;position=26&amp;uuid=5e3b0934-0d71-4782-bbf7-d1e739b45b27\">Freepik<\/a><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\"><em><strong>Topic modeling<\/strong><\/em> (i.e., topic identification in a corpus of text data) has developed quickly since the <em>Latent Dirichlet Allocation (LDA)<\/em> model <a href=\"https:\/\/www.jmlr.org\/papers\/volume3\/blei03a\/blei03a.pdf\">was published<\/a>. This classic topic model, however, does not well capture the relationships between words because it is based on the statistical concept of a <a href=\"https:\/\/miningthedetails.com\/LDA_Inference_Book\/word-representations.html\">bag of words<\/a>. Recent embedding-based <em><a href=\"https:\/\/github.com\/ddangelov\/Top2Vec\">Top2Vec<\/a><\/em> and <em>BERTopic<\/em> models address its drawbacks by exploiting pre-trained language models to generate topics.<\/p>\n<p class=\"wp-block-paragraph\">In this article, we&#8217;ll use <a href=\"https:\/\/arxiv.org\/abs\/2203.05794\">Maarten Grootendorst&#8217;s (2022)<\/a> <strong>BERTopic<\/strong> to identify the terms representing topics in political speech transcripts. It outperforms most traditional and modern topic models in topic modeling metrics on various corpora and has been used in <a href=\"https:\/\/maartengr.github.io\/BERTopic\/usecases.html#intelligent-virtual-assistants\">companies<\/a>, academia (<a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S2949719123000419\">Chagnon, 2024<\/a>), and the public sector. We&#8217;ll explore in Python code:<\/p>\n<ul class=\"wp-block-list\">\n<li>how to effectively preprocess data<\/li>\n<li>how to create a Bigram topic model<\/li>\n<li>how to explore the most frequent terms over time.<\/li>\n<\/ul>\n<h2 class=\"wp-block-heading\">1. Example data<\/h2>\n<p class=\"wp-block-paragraph\">As an example dataset, we&#8217;ll use the <em><a href=\"https:\/\/www.kaggle.com\/datasets\/efatazher\/empoliticon-political-speeches-context-and-emotion\">Empoliticon: Political Speeches-Context &amp; Emotion dataset<\/a><\/em>, released under the <em>Attribution 4.0 International license,<\/em> as part of the <a href=\"https:\/\/ieeexplore.ieee.org\/stamp\/stamp.jsp?tp=&amp;arnumber=10141612\">Efat et al. (2023)<\/a> paper. It contains 2010 transcripts of political speeches from the presidents\/ prime ministers of the USA, UK, China, and Russia. To make the topic model more focused, the subset only includes the 556 speeches of leaders from Russia:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ededed\" data-has-transparency=\"true\" style=\"--dominant-color: #ededed;\" loading=\"lazy\" decoding=\"async\" width=\"449\" height=\"380\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1LCFi3YGrb44iEOzOpsAJVQ.png\" alt=\"Source: Emopoliticon: Political Speeches-Context &amp; Emotion dataset\" class=\"wp-image-523051 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1LCFi3YGrb44iEOzOpsAJVQ.png 449w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1LCFi3YGrb44iEOzOpsAJVQ-300x254.png 300w\" sizes=\"auto, (max-width: 449px) 100vw, 449px\" \/><figcaption class=\"wp-element-caption\">Source: Emopoliticon: Political Speeches-Context &amp; Emotion dataset<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">2. Data pre-processing<\/h2>\n<p class=\"wp-block-paragraph\">Working with text datasets is complex. Just cleaning involves several steps that should systematically remove all unnecessary information from the dataset. Check all <a href=\"https:\/\/github.com\/PetrKorab\/Topic-Modelling-with-BERTtopic-in-Python\/blob\/main\/requirements.txt\">requirements <\/a>for this project here.<\/p>\n<p class=\"wp-block-paragraph\"><em><strong>2.1. Fixing mojibake errors<\/strong><\/em><\/p>\n<p class=\"wp-block-paragraph\"><em>Mojibake<\/em> is a Japanese word for the confusing text that results from character-encoding errors. Here is an example:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ededed\" data-has-transparency=\"true\" style=\"--dominant-color: #ededed;\" loading=\"lazy\" decoding=\"async\" width=\"469\" height=\"98\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1E8ox-BHOf-_leq2FG9gCZQ.png\" alt=\"Mojibake example\" class=\"wp-image-523052 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1E8ox-BHOf-_leq2FG9gCZQ.png 469w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1E8ox-BHOf-_leq2FG9gCZQ-300x63.png 300w\" sizes=\"auto, (max-width: 469px) 100vw, 469px\" \/><figcaption class=\"wp-element-caption\">Mojibake example<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">It is useful to include this step right at the beginning of the cleaning. Correcting encoding-related errors is simple:<\/p>\n<div class=\"wp-block-tds-gist-embed\">\n\t<script src=\"https:\/\/gist.github.com\/PetrKorab\/5e42fc26362e392688263eb42eec4d2d.js\"><\/script>\n<\/div>\n\n<p class=\"wp-block-paragraph\"><em><strong>2.2. Cleaning special characters, punctuation, and numbers<\/strong><\/em><\/p>\n<p class=\"wp-block-paragraph\">This step should come right after fixing the encoding errors. The simplest way is to use the <a href=\"http:\/\/cleantext\">cleantext <\/a>library. Also, consider lower-casing. Does <em>&quot;labor&quot;<\/em> mean the same as <em>&quot;Labor&quot;<\/em> in the dataset? In case it does, add a <code>lowercase<\/code> parameter and apply the cleaning function:<\/p>\n<div class=\"wp-block-tds-gist-embed\">\n\t<script src=\"https:\/\/gist.github.com\/PetrKorab\/385d1fc6362188bd342c5ae0ff0df57a.js\"><\/script>\n<\/div>\n\n<p class=\"wp-block-paragraph\"><em><strong>2.3. Define the stopwords removal strategy<\/strong><\/em><\/p>\n<p class=\"wp-block-paragraph\">Removing the standard list of stopwords is generally a necessary step. Depending on the project focus, it might also be useful to clean data from an additional list of stopwords that don&#8217;t bring any value. As written in the <a href=\"https:\/\/maartengr.github.io\/BERTopic\/faq.html#how-do-i-reduce-topic-outliers\">BERTopic&#8217;s documentation:<\/a><\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p><em>Removing stop words as a preprocessing step is not advised as the transformer-based embedding models that we use need the full context to create accurate embeddings.<\/em><\/p><\/blockquote>\n<p class=\"wp-block-paragraph\">Instead, we use <code>CountVectorizer<\/code> to preprocess our documents <strong>after<\/strong> having generated embeddings <strong>during<\/strong> the topic <strong>model generation.<\/strong><\/p>\n<h2 class=\"wp-block-heading\">3. Topic generation<\/h2>\n<p class=\"wp-block-paragraph\">Having a cleaner dataset, it is now possible to remove <em>English stopwords<\/em> along with a list of <em>additional stopwords<\/em>, generate a topic bigram model, and apply it to the data.<\/p>\n<div class=\"wp-block-tds-gist-embed\">\n\t<script src=\"https:\/\/gist.github.com\/PetrKorab\/7e6689c1fb2a16690f8cd70a95136ce0.js\"><\/script>\n<\/div>\n\n<p class=\"wp-block-paragraph\">Note that the <code>nr_topics<\/code>parameter is set to 7 for generating 6 topics. The remaining topic is used to keep the outliers.<\/p>\n<h2 class=\"wp-block-heading\">4. Topic visualization<\/h2>\n<p class=\"wp-block-paragraph\">In the next step, let&#8217;s visualize the data in a heatmap to present the results better. Here is the outcome:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"d6e0e3\" data-has-transparency=\"true\" style=\"--dominant-color: #d6e0e3;\" loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"1300\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/14J5McRW5ivv0NvlzT5JFvg.png\" alt=\"Figure 1: Heatmaps with bigrams and their probabilities, Image by Author\" class=\"wp-image-523054 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/14J5McRW5ivv0NvlzT5JFvg.png 1400w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/14J5McRW5ivv0NvlzT5JFvg-300x279.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/14J5McRW5ivv0NvlzT5JFvg-1024x951.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/14J5McRW5ivv0NvlzT5JFvg-768x713.png 768w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 1: Heatmaps with bigrams and their probabilities, Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">To do so, we&#8217;ll extract bigrams and their probabilities from the topic model and create a data frame for each of the 6 topics:<\/p>\n<div class=\"wp-block-tds-gist-embed\">\n\t<script src=\"https:\/\/gist.github.com\/PetrKorab\/59a9ee9fb6e9782a890617408cbd4a2f.js\"><\/script>\n<\/div>\n\n<p class=\"wp-block-paragraph\">Next, this code creates a heatmap in Figure 1.<\/p>\n<div class=\"wp-block-tds-gist-embed\">\n\t<script src=\"https:\/\/gist.github.com\/PetrKorab\/90643177780797b1649a08ef3bc3566d.js\"><\/script>\n<\/div>\n\n<h2 class=\"wp-block-heading\">5. Token frequencies over time<\/h2>\n<p class=\"wp-block-paragraph\">Now, we&#8217;ll add a perspective on the development of bigrams over time. The goal is to look at which years the bigrams in the Russian leader(s) speeches were most frequently spoken out. The heatmap in Figure 2 displays the frequencies of the 5 most frequent bigrams for each year.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f4f4f4\" data-has-transparency=\"true\" style=\"--dominant-color: #f4f4f4;\" loading=\"lazy\" decoding=\"async\" width=\"7918\" height=\"7194\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1ZfCNTRW7m3LbB9zaxiHvrQ.png\" alt=\"Figure 2: Heatmaps with bigrams and their frequencies by year, Image by Author\" class=\"wp-image-523056 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1ZfCNTRW7m3LbB9zaxiHvrQ.png 7918w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1ZfCNTRW7m3LbB9zaxiHvrQ-300x273.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1ZfCNTRW7m3LbB9zaxiHvrQ-1024x930.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1ZfCNTRW7m3LbB9zaxiHvrQ-768x698.png 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1ZfCNTRW7m3LbB9zaxiHvrQ-1536x1396.png 1536w\" sizes=\"auto, (max-width: 7918px) 100vw, 7918px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Heatmaps with bigrams and their frequencies by year, Image by Author<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The <strong>arabica<\/strong> library, which is now forthcoming in the <strong><a href=\"https:\/\/joss.theoj.org\/\">Journal of Open Source Software<\/a><\/strong> (Kor\u00e1b &amp; Pom\u011bnkov\u00e1, 2024), was developed for this purpose.<\/p>\n<p class=\"wp-block-paragraph\"><em><strong>EDIT Jul 2024<\/strong>: Arabica has been updated. Check the <strong><a href=\"https:\/\/arabica.readthedocs.io\/en\/latest\/index.html\">documentation<\/a><\/strong> for the full list of parameters.<\/em><\/p>\n<p class=\"wp-block-paragraph\">Here is the code generating the heatmap in Figure 2:<\/p>\n<div class=\"wp-block-tds-gist-embed\">\n\t<script src=\"https:\/\/gist.github.com\/PetrKorab\/ba3e3d26532bfde675ce9ef7118c44de.js\"><\/script>\n<\/div>\n\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"b0aba4\" data-has-transparency=\"false\" style=\"--dominant-color: #b0aba4;\" loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1709\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1XXD7PsD37A1OgZi5c0_ENA-scaled.jpeg\" alt=\"Image by rawpixel on Freepik\" class=\"wp-image-523058 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1XXD7PsD37A1OgZi5c0_ENA-scaled.jpeg 2560w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1XXD7PsD37A1OgZi5c0_ENA-300x200.jpeg 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1XXD7PsD37A1OgZi5c0_ENA-1024x683.jpeg 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1XXD7PsD37A1OgZi5c0_ENA-768x513.jpeg 768w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1XXD7PsD37A1OgZi5c0_ENA-1536x1025.jpeg 1536w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1XXD7PsD37A1OgZi5c0_ENA-2048x1367.jpeg 2048w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><figcaption class=\"wp-element-caption\">Image by rawpixel on <a href=\"https:\/\/www.freepik.com\/free-photo\/improvement-summary-personal-development-workflow_18043786.htm#fromView=search&amp;page=1&amp;position=0&amp;uuid=532be1da-a5ce-4d13-8dcb-141609e40365\">Freepik<\/a><\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Conclusions<\/h2>\n<p class=\"wp-block-paragraph\">This article briefly introduced topic modeling with BERTopic. The model&#8217;s framework offers many extensions, fine-tuning, and visualization methods (see <a href=\"https:\/\/maartengr.github.io\/BERTopic\/index.html\">the documentation<\/a>). Let&#8217;s summarize the key findings:<\/p>\n<ul class=\"wp-block-list\">\n<li>topic models show 6 distinct topics for <em><strong>defense policy<\/strong><\/em> (topic 1), <em><strong>economic development<\/strong><\/em> (topic 2), <em><strong>WW2<\/strong><\/em> (topic 3 ), <em><strong>internal policies<\/strong><\/em> (topic 4), <em><strong>healthcare and demographics<\/strong><\/em> (topic 5), and <em><strong>education<\/strong><\/em> (topic 6).<\/li>\n<li>combining BERTopic with Arabica, we can see that the <strong>foreign<\/strong> and <strong>defense policy<\/strong> topics (<em>&quot;armed forces&quot;, &quot;russian federation&quot;, &quot;law enforcement&quot;<\/em>) were more frequently discussed before 2012, while there is a shallow frequency of topics discussed related to <strong>education<\/strong> and <strong>healthcare<\/strong>, especially after 2010.<\/li>\n<li>The dataset contains more <strong>WW2, foreign,<\/strong> and <strong>defense policy<\/strong> terms because Arabica returns absolute frequencies. However, it&#8217;s difficult to interpret the results well without knowing the regional context.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/medium.com\/python-in-plain-english\/end-to-end-latent-dirichlet-allocation-in-python-ac7bf75cd9fc\">My previous article <\/a>briefly explains a simpler approach to topic modeling with <strong>LDA<\/strong>. The complete code for this tutorial is on my <a href=\"https:\/\/github.com\/PetrKorab\/Topic-Modelling-with-BERTtopic-in-Python\/blob\/main\/code.ipynb\">GitHub<\/a>.<\/p>\n<p class=\"wp-block-paragraph\"><em>If you enjoy my work, you can invite me <a href=\"https:\/\/www.buymeacoffee.com\/petrkorab\">for coffee<\/a> and support my writing. You can also subscribe to my <a href=\"https:\/\/medium.com\/subscribe\/@petrkorab\">email list<\/a> to get notified about my new articles. Thanks!<\/em><\/p>\n<h2 class=\"wp-block-heading\">References<\/h2>\n<p class=\"wp-block-paragraph\">[1] Blei, Ng, Jordan (2003). Latent Dirichlet Allocation. <em><a href=\"https:\/\/www.jmlr.org\/papers\/volume3\/blei03a\/blei03a.pdf\">Journal Of Machine Learning Research<\/a><\/em> 3, pp. 993\u20131022.<\/p>\n<p class=\"wp-block-paragraph\">[2] Chagnon, Pandolfi, Donatelli, Ushizima (2024). Benchmarking topic models on scientific articles using BERTeley. <em><a href=\"https:\/\/www.sciencedirect.com\/science\/article\/pii\/S2949719123000419\">Natural Language Processing Journal 6.<\/a><\/em><\/p>\n<p class=\"wp-block-paragraph\">[3] Efat, Atiq, Abeed, Momin, Alam (2023). Empoliticon: NLP And MLBased Approach For Context And Emotion Classification Of Political Speeches From Transcripts. <em><a href=\"https:\/\/ieeexplore.ieee.org\/document\/10141612\">IEEE Access<\/a>,<\/em> vol. 11.<\/p>\n<p class=\"wp-block-paragraph\">[4] Grootendorst (2022). Bertopic: Neural Topic Modeling With A Class-Based TF-IDF Procedure. <em><a href=\"https:\/\/arxiv.org\/abs\/2203.05794\">Computer Science.<\/a><\/em><\/p>\n<p class=\"wp-block-paragraph\">[5] Kor\u00e1b, Pom\u011bnkov\u00e1 (2024). Arabica: A Python package for exploratory analysis of text data. In The Journal of Open Source Software. Journal of Open Source Software. <a href=\"https:\/\/doi.org\/10.5281\/zenodo.10866697\">https:\/\/doi.org\/10.5281\/zenodo.10866697<\/a>.<\/p>","protected":false},"excerpt":{"rendered":"<p>Hands-on tutorial on modeling political statements with a state-of-the-art transformer-based topic model<\/p>\n","protected":false},"author":18,"featured_media":130691,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"is_member_only":true,"sub_heading":"Hands-on tutorial on modeling political statements with a state-of-the-art transformer-based topic model","footnotes":""},"categories":[44],"tags":[11086,448,467,1604,512],"sponsor":[],"coauthors":[30697],"class_list":["post-130690","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","tag-bertopic","tag-data-science","tag-python","tag-text-mining","tag-topic-modeling"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Topic Modelling with BERTtopic in Python | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Topic Modelling with BERTtopic in Python | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"Hands-on tutorial on modeling political statements with a state-of-the-art transformer-based topic model\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2024-04-01T12:25:04+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-03-05T12:19:20+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg\" \/>\n\t<meta property=\"og:image:width\" content=\"2161\" \/>\n\t<meta property=\"og:image:height\" content=\"1462\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Petr Kor\u00e1b\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Petr Kor\u00e1b\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Topic Modelling with BERTtopic in Python\",\"datePublished\":\"2024-04-01T12:25:04+00:00\",\"dateModified\":\"2025-03-05T12:19:20+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/\"},\"wordCount\":952,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg\",\"keywords\":[\"Bertopic\",\"Data Science\",\"Python\",\"Text Mining\",\"Topic Modeling\"],\"articleSection\":[\"Data Science\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/\",\"url\":\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/\",\"name\":\"Topic Modelling with BERTtopic in Python | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg\",\"datePublished\":\"2024-04-01T12:25:04+00:00\",\"dateModified\":\"2025-03-05T12:19:20+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg\",\"width\":2161,\"height\":1462,\"caption\":\"Photo by Harryarts on Freepik\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Topic Modelling with BERTtopic in Python\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Topic Modelling with BERTtopic in Python | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/","og_locale":"en_US","og_type":"article","og_title":"Topic Modelling with BERTtopic in Python | Towards Data Science","og_description":"Hands-on tutorial on modeling political statements with a state-of-the-art transformer-based topic model","og_url":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/","og_site_name":"Towards Data Science","article_published_time":"2024-04-01T12:25:04+00:00","article_modified_time":"2025-03-05T12:19:20+00:00","og_image":[{"width":2161,"height":1462,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg","type":"image\/jpeg"}],"author":"Petr Kor\u00e1b","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Petr Kor\u00e1b","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Topic Modelling with BERTtopic in Python","datePublished":"2024-04-01T12:25:04+00:00","dateModified":"2025-03-05T12:19:20+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/"},"wordCount":952,"commentCount":0,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg","keywords":["Bertopic","Data Science","Python","Text Mining","Topic Modeling"],"articleSection":["Data Science"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/","url":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/","name":"Topic Modelling with BERTtopic in Python | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg","datePublished":"2024-04-01T12:25:04+00:00","dateModified":"2025-03-05T12:19:20+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/13Sqr6LH1agLPsrR4kOl5ZA.jpeg","width":2161,"height":1462,"caption":"Photo by Harryarts on Freepik"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/topic-modelling-with-berttopic-in-python-8a80d529de34\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Topic Modelling with BERTtopic in Python"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Towards Data Science","distributor_original_site_url":"https:\/\/towardsdatascience.com","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/130690","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=130690"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/130690\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/130691"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=130690"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=130690"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=130690"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=130690"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=130690"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}