{"id":148646,"date":"2023-02-09T13:29:44","date_gmt":"2023-02-09T13:29:44","guid":{"rendered":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/"},"modified":"2025-01-17T08:08:16","modified_gmt":"2025-01-17T08:08:16","slug":"text-data-pre-processing-for-time-series-models-162c0d01f5c5","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/","title":{"rendered":"Text Data Pre-processing for Time-Series Models"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n<p class=\"wp-block-paragraph\">Text data offer qualitative information that can be quantified, aggregated, and used as a variable in time-series models. Simple methods of text data representation, such as <a href=\"https:\/\/machinelearningmastery.com\/how-to-one-hot-encode-sequence-data-in-python\/\">one-hot encoding<\/a> of categorical variables and word <a href=\"https:\/\/www.analyticsvidhya.com\/blog\/2021\/09\/what-are-n-grams-and-how-to-implement-them-in-python\/\">n-grams<\/a>, have been used since NLP&#8217;s early beginnings. Over time, more complex methods, including the <a href=\"https:\/\/machinelearningmastery.com\/gentle-introduction-bag-words-model\/\">Bag-of-words<\/a> model, found their way to represent text data for machine learning algorithms. Based on the distributional hypothesis formulated by Harris [1] and Firth [2], modern models such as Word-to-Vec [3] and [4], GloVe [5], and ELMo [6] use vector representation of words in their neural network architectures. Since computers process text as vectors, it can be used as a variable in time-series econometric models.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>In this way, we can use qualitative information from the text and use it to extend the possibilities of quantitative time-series models.<\/p><\/blockquote>\n<p class=\"wp-block-paragraph\">In this article, you&#8217;ll learn more about:<\/p>\n<ul class=\"wp-block-list\">\n<li>How to use qualitative information from text for quantitative modeling<\/li>\n<li>How to clean and represent text data for time-series models<\/li>\n<li>How to work efficiently with <strong>1 million rows of text data<\/strong><\/li>\n<li>End-to-end coding example in Python.<\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">In our recent conference paper, we developed a structural plan for text-data pre-processing that might be used for areas such as: (1) predicting exchange rates with the sentiment from social networks, (2) predicting agricultural prices using public news data, (3) demand prediction in various areas.<\/p>\n<h2 class=\"wp-block-heading\"><strong>1. Structural plan of<\/strong> text data representation<\/h2>\n<p class=\"wp-block-paragraph\">Let&#8217;s start with a plan. In the beginning, there is <strong>qualitative raw text<\/strong> data collected over time. In the end, we have empirical estimates with time-varying numerical vectors (= <strong>quantitative data<\/strong>). This graph says more about how we will proceed:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e6e7e9\" data-has-transparency=\"false\" style=\"--dominant-color: #e6e7e9;\" loading=\"lazy\" decoding=\"async\" width=\"659\" height=\"997\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/1wD3IaezQHd8JarhLvPpUbQ.png\" alt=\"Figure 1. Structural plan of text data representation. Source: Pom\u011bnkov\u00e1 et al., submitted to MAREW 2023.\" class=\"wp-image-201928 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/1wD3IaezQHd8JarhLvPpUbQ.png 659w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/1wD3IaezQHd8JarhLvPpUbQ-198x300.png 198w\" sizes=\"auto, (max-width: 659px) 100vw, 659px\" \/><figcaption class=\"wp-element-caption\">Figure 1. Structural plan of text data representation. Source: Pom\u011bnkov\u00e1 et al., submitted to <a href=\"https:\/\/www.marew.cz\/\">MAREW 2023<\/a>.<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">2. Empirical example in Python<\/h2>\n<p class=\"wp-block-paragraph\">Let&#8217;s illustrate the coding on the <strong><a href=\"https:\/\/www.kaggle.com\/datasets\/rmisra\/news-category-dataset\">News Category Dataset<\/a><\/strong> compiled by Rishabh Misra [8], [9] and released under the <a href=\"https:\/\/creativecommons.org\/licenses\/by\/4.0\/\">Attribution 4.0 International<\/a> license. The data contains news headlines published between 2012 and 2022 on huffpost.com. It was multiplicated to reach a 1-million-row dataset.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>The primary aim is to construct a time series in monthly frequency from news headlines reflecting public sentiment.<\/p><\/blockquote>\n<p class=\"wp-block-paragraph\">The dataset contains 1 million headlines. Because of its size, I used <a href=\"https:\/\/pypi.org\/project\/polars\/\">Polars <\/a>library, which makes dataframe operations much faster. Compared to the mainstream Pandas, it handles large data files highly efficiently. On top of that, the code was run in Google Colab with GPU hardware accelerator.<\/p>\n<p class=\"wp-block-paragraph\">The python code is <a href=\"https:\/\/github.com\/PetrKorab\/Text-Data-Pre-processing-for-Time-series-Models\">here<\/a>, and the data looks like this:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e9e9e9\" data-has-transparency=\"false\" style=\"--dominant-color: #e9e9e9;\" loading=\"lazy\" decoding=\"async\" width=\"956\" height=\"269\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/14JRHzqIwgmuGH3dxoRIXqg.png\" alt=\"Figure 2. News Category Dataset\" class=\"wp-image-201929 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/14JRHzqIwgmuGH3dxoRIXqg.png 956w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/14JRHzqIwgmuGH3dxoRIXqg-300x84.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/14JRHzqIwgmuGH3dxoRIXqg-768x216.png 768w\" sizes=\"auto, (max-width: 956px) 100vw, 956px\" \/><figcaption class=\"wp-element-caption\">Figure 2. News Category Dataset<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">2.1. Text data pre-processing<\/h2>\n<p class=\"wp-block-paragraph\">The purpose of text data pre-processing is to remove all redundant information that might bias the analysis or lead to an incorrect interpretation of the results. We&#8217;ll remove <strong>punctuation<\/strong>, <strong>numbers<\/strong>, <strong>extra spaces<\/strong>, English <strong>stopwords<\/strong> (most common words with low or zero information value), and <strong>lowercase<\/strong> the text.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Probably the simplest and most efficient way of cleaning text data in Python is with <a href=\"https:\/\/pypi.org\/project\/cleantext\/\">cleantext <\/a>library.<\/p><\/blockquote>\n<p class=\"wp-block-paragraph\">First, define a cleaning function to perform the cleaning operations:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def preprocess(text):\n    output = clean(str(text), punct=True,\n                              extra_spaces=True,\n                              stopwords=True,\n                              lowercase=True,\n                              numbers = True)\n    return output<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Next, we clean the 1 mil. dataset with Polars:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">data_clean = data.with_columns([\n    pl.col(&quot;headline&quot;).apply(preprocess)\n])<\/code><\/pre>\n<p class=\"wp-block-paragraph\">The clean dataset contains text with maximum informational value for further steps. Any unnecessary strings and digits reduce the accuracy of the final empirical modeling.<\/p>\n<h2 class=\"wp-block-heading\">2.2. Text data representation<\/h2>\n<p class=\"wp-block-paragraph\"><strong>Data representation<\/strong> involves methods used to represent data in a computer. Since computers work with numbers, we select an appropriate model to vectorize the text dataset.<\/p>\n<p class=\"wp-block-paragraph\">In our project, we are constructing a time series of sentiment. For this use case, the pre-trained sentiment classifier <strong>VADER _(Valence Aware Dictionary and Sentiment Reasoner)<\/strong>_ is a good choice. Read my <a href=\"https:\/\/medium.com\/towards-data-science\/the-most-favorable-pre-trained-sentiment-classifiers-in-python-9107c06442c6\">previous article<\/a> to learn more about this classifier, along with some other alternatives.<\/p>\n<p class=\"wp-block-paragraph\">The classification with <a href=\"https:\/\/pypi.org\/project\/vaderSentiment\/\">vaderSentiment <\/a>library looks in the code as follows. First, create the function for classification:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer\n\n# calculate the compound score\ndef sentiment_vader(sentence):\n\n    # create a SentimentIntensityAnalyzer object\n    sid_obj = SentimentIntensityAnalyzer()\n\n    sentiment_dict = sid_obj.polarity_scores(sentence)\n\n    # create overall (compound) indicator\n    compound = sentiment_dict[&#039;compound&#039;]\n\n    return compound<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Next, apply the function for the time-series dataset:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># apply the function with Polars\n\nsentiment = data_clean.with_columns([\n    pl.col(&quot;headline&quot;).apply(sentiment_vader)\n])<\/code><\/pre>\n<p class=\"wp-block-paragraph\">Here is what the result looks like:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e8e8e8\" data-has-transparency=\"false\" style=\"--dominant-color: #e8e8e8;\" loading=\"lazy\" decoding=\"async\" width=\"232\" height=\"347\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/1MPWnwTqjmHtsYcZjGtYsRw.png\" alt=\"Figure 3. Sentiment evaluation\" class=\"wp-image-201930 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/1MPWnwTqjmHtsYcZjGtYsRw.png 232w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/1MPWnwTqjmHtsYcZjGtYsRw-201x300.png 201w\" sizes=\"auto, (max-width: 232px) 100vw, 232px\" \/><figcaption class=\"wp-element-caption\">Figure 3. Sentiment evaluation<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The <em>headline<\/em> column includes sentiment on the scale [-1:1] reflecting prevalent emotional content in the headlines for each row.<\/p>\n<h2 class=\"wp-block-heading\">2.3. Time-series representation<\/h2>\n<p class=\"wp-block-paragraph\">The next step in time-series text data representation involves extending the data matrix with a time dimension. It can be achieved by (a) aggregating data along a time axis and (b) selecting a method implementing time-series text data representation. In the case of our data, we&#8217;ll do the former and aggregate sentiment from each row by monthly frequency.<\/p>\n<p class=\"wp-block-paragraph\">This code makes the average aggregation of sentiment and prepares monthly time series:<\/p>\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># aggregate over months\n\ntimeseries = (sentiment.lazy()\n    .groupby(&quot;date_monthly&quot;)\n    .agg(\n        [\n            pl.avg(&quot;headline&quot;)\n        ]\n    ).sort(&quot;date_monthly&quot;)\n).collect()<\/code><\/pre>\n<h2 class=\"wp-block-heading\">2.4. Quantitative modeling<\/h2>\n<p class=\"wp-block-paragraph\">The final step is to use the time series for modeling. To show an example, in our recent conference paper, we similarly extracted the sentiment from headlines of research articles published in the top 5 economic journals. Then, we use rolling time-varying correlations of a 5-year window and looked at how sentiment relates to GDP and other global economic indicators (see figure 2).<\/p>\n<p class=\"wp-block-paragraph\">We hypothesized that sentiment correlates with the macroeconomic environment during periods of sharp recessions and inflation shocks. The results support, except for one specific journal, these considerations for the Oil Shocks of the 1970s, which led to a steep recession accompanied by a massive inflation spike.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f3f0ee\" data-has-transparency=\"false\" style=\"--dominant-color: #f3f0ee;\" loading=\"lazy\" decoding=\"async\" width=\"918\" height=\"767\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/1dEn5AmgP5iZiUgw8pdTdAg.png\" alt=\"Figure 4. Rolling correlations of sentiment and GDP. Source: Pom\u011bnkov\u00e1 et al., submitted to MAREW 2023.\" class=\"wp-image-201931 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/1dEn5AmgP5iZiUgw8pdTdAg.png 918w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/1dEn5AmgP5iZiUgw8pdTdAg-300x251.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/1dEn5AmgP5iZiUgw8pdTdAg-768x642.png 768w\" sizes=\"auto, (max-width: 918px) 100vw, 918px\" \/><figcaption class=\"wp-element-caption\">Figure 4. Rolling correlations of sentiment and GDP. Source: Pom\u011bnkov\u00e1 et al., submitted to <a href=\"https:\/\/www.marew.cz\/\">MAREW 2023<\/a>.<\/figcaption><\/figure>\n<h2 class=\"wp-block-heading\">Conclusions<\/h2>\n<p class=\"wp-block-paragraph\">In this article, we have constructed monthly time series of sentiment from 1 million rows of text data. The key points are:<\/p>\n<ul class=\"wp-block-list\">\n<li><strong>Qualitative information can extend the capacities of quantitative time-series models<\/strong><\/li>\n<li><strong>Polars library makes large text-data pre-processing feasible even in Python language<\/strong><\/li>\n<li><strong>Cloud services such as Google Colab make the processing of extensive text datasets even faster.<\/strong><\/li>\n<\/ul>\n<p class=\"wp-block-paragraph\">The complete code in this tutorial is on my <a href=\"https:\/\/github.com\/PetrKorab\/Text-Data-Pre-processing-for-Time-series-Models\">GitHub<\/a>. The recommended reading is <em><a href=\"https:\/\/medium.com\/towards-data-science\/the-most-favorable-pre-trained-sentiment-classifiers-in-python-9107c06442c6\">The Most Favorable Pre-trained Sentiment Classifiers in Python<\/a>.<\/em><\/p>\n<p class=\"wp-block-paragraph\"><em>Did you like the article? You can invite me <a href=\"https:\/\/www.buymeacoffee.com\/petrkorab\">for coffee<\/a> and support my writing. You can also subscribe to my <a href=\"https:\/\/medium.com\/subscribe\/@petrkorab\">email list<\/a> to get notified about my new articles. Thanks!<\/em><\/p>\n<h2 class=\"wp-block-heading\"><strong>References<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">[1] Z. Harris. 1954. Distributional structure. <em>Word<\/em>, vol. 10, no. 23, pp. 146\u2013162.<\/p>\n<p class=\"wp-block-paragraph\">[2] J. R. Firth. 1957. A synopsis of linguistic theory 1930\u20131955. In Studies in Linguistic Analysis, pp. 1\u201332. Oxford: Philological Society. Reprinted in F.R. Palmer (ed.), Selected Papers of J.R. Firth 1952\u20131959, London: Longman 1968.<\/p>\n<p class=\"wp-block-paragraph\">[3] Mikolov, T., Chen, K., Corrado, G. S., Dean, J. 2013b. Efficient estimation of word representations in vector space. Computation and Language: International Conference on Learning Representations.<\/p>\n<p class=\"wp-block-paragraph\">[4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado and J. Dean. 2013. Distributed representations of words and phrases and their compositionality. <em>Advances in Neural Information Processing Systems<\/em>, vol. 26 (NIPS 2013).<\/p>\n<p class=\"wp-block-paragraph\">[5] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee and L. Zettlemoyer, L. 2018. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1.<\/p>\n<p class=\"wp-block-paragraph\">[6] J. Pennington, R. Socher and C. D. Manning. 2014. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).<\/p>\n<p class=\"wp-block-paragraph\">[7] Pom\u011bnkov\u00e1, J., Kor\u00e1b, P., \u0160trba, D. Text Data Pre-processing for Time-series Modelling. Submitted to <a href=\"https:\/\/www.marew.cz\/\">MAREW 2023<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">[8] Misra, Rishabh. &quot;News Category Dataset.&quot; arXiv preprint arXiv:2209.11429 (2022).<\/p>\n<p class=\"wp-block-paragraph\">[9] Misra, Rishabh and Jigyasa Grover. &quot;Sculpting Data for ML: The first act of Machine Learning.&quot; ISBN 9798585463570 (2021).<\/p>","protected":false},"excerpt":{"rendered":"<p>Have you ever thought about how sentiment from text data can be used as a regressor in time-series models?<\/p>\n","protected":false},"author":18,"featured_media":148647,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"is_member_only":true,"sub_heading":"Have you ever thought about how sentiment from text data can be used as a regressor in time-series models?","footnotes":""},"categories":[],"tags":[1328,467,1604,461],"sponsor":[],"coauthors":[30697],"class_list":["post-148646","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","tag-data-processing","tag-python","tag-text-mining","tag-time-series-analysis"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Text Data Pre-processing for Time-Series Models | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Text Data Pre-processing for Time-Series Models | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"Have you ever thought about how sentiment from text data can be used as a regressor in time-series models?\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2023-02-09T13:29:44+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-01-17T08:08:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/03rWRnNWiXaycbaWy-scaled.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"2560\" \/>\n\t<meta property=\"og:image:height\" content=\"1707\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Petr Kor\u00e1b\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Petr Kor\u00e1b\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Text Data Pre-processing for Time-Series Models\",\"datePublished\":\"2023-02-09T13:29:44+00:00\",\"dateModified\":\"2025-01-17T08:08:16+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/\"},\"wordCount\":1156,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/03rWRnNWiXaycbaWy-scaled.jpg\",\"keywords\":[\"Data Processing\",\"Python\",\"Text Mining\",\"Time Series Analysis\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/\",\"url\":\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/\",\"name\":\"Text Data Pre-processing for Time-Series Models | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/03rWRnNWiXaycbaWy-scaled.jpg\",\"datePublished\":\"2023-02-09T13:29:44+00:00\",\"dateModified\":\"2025-01-17T08:08:16+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/03rWRnNWiXaycbaWy-scaled.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/03rWRnNWiXaycbaWy-scaled.jpg\",\"width\":2560,\"height\":1707,\"caption\":\"Photo by Kaleidico on Unsplash\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Text Data Pre-processing for Time-Series Models\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Text Data Pre-processing for Time-Series Models | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/","og_locale":"en_US","og_type":"article","og_title":"Text Data Pre-processing for Time-Series Models | Towards Data Science","og_description":"Have you ever thought about how sentiment from text data can be used as a regressor in time-series models?","og_url":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/","og_site_name":"Towards Data Science","article_published_time":"2023-02-09T13:29:44+00:00","article_modified_time":"2025-01-17T08:08:16+00:00","og_image":[{"width":2560,"height":1707,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/03rWRnNWiXaycbaWy-scaled.jpg","type":"image\/jpeg"}],"author":"Petr Kor\u00e1b","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Petr Kor\u00e1b","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Text Data Pre-processing for Time-Series Models","datePublished":"2023-02-09T13:29:44+00:00","dateModified":"2025-01-17T08:08:16+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/"},"wordCount":1156,"commentCount":0,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/03rWRnNWiXaycbaWy-scaled.jpg","keywords":["Data Processing","Python","Text Mining","Time Series Analysis"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/","url":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/","name":"Text Data Pre-processing for Time-Series Models | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/03rWRnNWiXaycbaWy-scaled.jpg","datePublished":"2023-02-09T13:29:44+00:00","dateModified":"2025-01-17T08:08:16+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/03rWRnNWiXaycbaWy-scaled.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2023\/02\/03rWRnNWiXaycbaWy-scaled.jpg","width":2560,"height":1707,"caption":"Photo by Kaleidico on Unsplash"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/text-data-pre-processing-for-time-series-models-162c0d01f5c5\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Text Data Pre-processing for Time-Series Models"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Towards Data Science","distributor_original_site_url":"https:\/\/towardsdatascience.com","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/148646","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=148646"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/148646\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/148647"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=148646"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=148646"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=148646"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=148646"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=148646"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}