Publish AI, ML & data-science insights to a global community of data professionals.

Text as Time Series: Arabica 1.0 Brings New Features for Exploratory Text Data Analysis

Arabica 1.0 improves time series text data analysis with an extended set of features

Photo by Sincerely Media on Unsplash
Photo by Sincerely Media on Unsplash

Introduction

In the real world, text data is frequently collected as time series. Some examples include companies collecting product reviews where the quality of their products might change. Politicians’ public statements can vary over the political cycle. Central bankers’ announcements are one of the ways how central banks affect the financial markets nowadays. For these reasons, text data often has a time dimension and is recorded with a date/time column.

Exploratory data analysis (EDA) of these datasets is not a trivial coding exercise. And here comes Arabica – to make things simpler. This article covers the following:

  • new aggregation and cleaning features in Arabica 1.0
  • real-world applications of time-series text data analysis

We will show what’s new in Arabica 1.0 using Twitter Data of 2016 USA Presidential election tweets from data.world. The dataset is licensed under the general Public domain license.

EDIT Jan 2023: Arabica has been updated. Check the documentation for the full list of parameters.

What is Arabica?

Arabica is a Python library for exploratory data analysis specifically designed for time series text data. My previous Towards Data Science article provides an elementary introduction. I recommend reading it before proceeding further in this article as a refresher.

It can be installed with pip:

pip install arabica

Arabica takes text and time as the input, enables standard cleaning operations (numbers, punctuation, and stopwords removal), and provides unigram bigram and trigram frequencies over a **** selected period.

Figure 1: Scheme of arabica_freq method
Figure 1: Scheme of arabica_freq method

As input text, it works with texts of languages based on the Latin alphabet and enables stopword removal for languages in the ntlk **** corpus of stopwords.

For time specification, Arabica currently reads dates in standard date and datetime formats and provides aggregated n-grams in yearly, monthly, or daily frequencies. The standard formats are, e.g., 2013-12-31, 2013/12/31, 31-Dec-2013, 2013-12-31 11:46:17.

Be careful about the European vs. US date format difference. It is recommended to use the US-style dates (MM/DD/YYYY) rather than the European style (DD/MM/YYYY) since there might be a mismatch between months and days in small datasets.

Arabica now has a visualization module providing word clouds, heatmaps, and line plots for unigrams, bigrams, and trigrams frequencies. See this tutorial for more examples.

Twitter data use case

Let’s see how it works in the code. We’ll first import arabica and pandas and read the data.

import pandas as pd
from arabica import arabica_freq

The subset of data we’ll be using contains President Donal Trump’s tweets during the last weeks of his presidential campaign and the first couple of weeks after he took office. Here is what the data looks like:

Figure 2: Twitter data used as an example dataset
Figure 2: Twitter data used as an example dataset

Extended aggregation options

Time-series n-gram analysis

We can see the data is pretty raw. It contains lots of numbers, special characters, punctuation, and strings we don’t need. We will also skip variations of politicians’ names and other strings specific to this dataset. The text is also made lowercase so that capital letters don’t affect n-gram calculations (e.g., "Tree" is not treated differently from "tree").

In Arabica, it is all straightforward:

arabica_freq(text = data['tweet_text'],         # Text
             time = data['created_at'],         # Time
             time_freq = 'D',                   # Aggregation period
             max_words = 2,           # Max number for n-grams to be displayed
             stopwords = ['english'],           # Language for stop words
             skip = ['realdonaldtrump','trump', # Remove additional strings
                     'makehillarybrokeagain',
                     'hillaryclinton',
                     'barackobama',
                     'hillary', 'clintons',
                     'wwmorgan','jordanla',
                     'whoisNeil','cattmaxx',
                     'vnkbgygv', 'httpstco',
                     'httpstcob', 'httpstcod',
                     'nrclwvhx','httpstcoj',
                     'httpstcossh'],
             lower_case = True,     # Lowercase text before cleaning and frequency analysis
             numbers = True,                    # Remove all digits
             punct = True)                      # Remove punctuation

The n-gram frequencies, especially bigrams and trigrams, tell us more about President Trump’s public communication after he took over the presidency. It is better to increase the value for themax_words parameter (say 4 or 5) to explore it deeper.

Figure 3: Arabica_freq output - daily aggregation
Figure 3: Arabica_freq output – daily aggregation

Descriptive n-gram analysis

In certain situations, we need to make a simple EDA of the text dataset as the first step before developing more complex analyses. In the code, setting the parameter time_freq = 'ungroup' means that Arabica calculates the unigram, bigram, and trigram frequencies of the whole data, and no time aggregation is made.

Running this code, we create a result dataframe that can be later saved as a CSV file:

result = arabica_freq(text = data['tweet_text'],         # Text
                      time = data['created_at'],         # Time
                      time_freq = 'ungroup',             # No time aggregation made
                      max_words = 6,                # Shows 6 most frequent n-grams
                      stopwords = ['english'],           # Language for stop words
                      skip = ['realdonaldtrump','trump', # Strings to remove
                             'makehillarybrokeagain',
                             'hillaryclinton',
                             'barackobama',
                             'hillary', 'clintons',
                             'wwmorgan','jordanla',
                             'whoisNeil','cattmaxx',
                             'vnkbgygv', 'httpstco',
                             'httpstcob', 'httpstcod',
                             'nrclwvhx','httpstcoj',
                             'httpstcossh'],
                     lower_case = True,                 # Lowercase text 
                     numbers = True,                    # Remove all digits
                     punct = True)                      # Remove all punctuation

Here is the output table:

Figure 4: Arabica_freq output - no time aggregation
Figure 4: Arabica_freq output – no time aggregation

Extended cleaning options

In Arabica 1.0. we can remove more sets of stopwords at once. It is useful for text mining data from countries with more than one official language (Canada, Switzerland, Belgium, Luxembourg, etc.) or countries with large foreign communities.

To remove English, French, and German stopwords from the text, modify the code in this way:

arabica_freq(text = data['tweet_text'],
             time = data['created_at'],
             time_freq = 'D', 
             max_words = 2, 
             stopwords = ['english','french','german'],
             skip = None,
             lower_case = True,
             numbers = True,
             punct = True) 

Photo by Geralt on Pixabay
Photo by Geralt on Pixabay

Real-life applications

Let’s mention several examples where Arabica might be helpful. In Marketing, businesses use market positioning methods to influence consumer perception regarding a brand or product relative to competitors. The objective here is to establish the image or identity of a company or product so that consumers perceive it in a certain way.

Product and site reviews provide excellent data to evaluate positioning strategies. Did we receive bigrams such as "fast delivery," "excellent service," "great service," "quick delivery," or "good prices" with high frequency? Does it align with how we want our customers to perceive our product? Does it change over time?

In Politology and Political Economics, analyzing public discourse with **** content analysis methods is widely spread (e.g., Saraisky, 2015). The essential topics we might focus on here are, e.g., the consequences of fake news and populism, the public attitude towards specific problems (immigration, weapons possession, etc.), and case studies of individual politicians or political parties. Twitter tweets provide a good data source for these investigations.

Not only could these issues be studied on time-series text datasets. You may find many other use cases where text has a time-series character. Some of them I have developed in Arabica’s documentation.


EDIT: Arabica now has a visualization module to display text as a heatmap, word cloud, and line plot, and a sentiment and structural breaks analytical module. Read more in tutorials for visualization, sentiment analysis, meta-data analysis, and customer satisfaction measurement.

Did you like the article? You can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!

References:

[1] Saraisky, N.G. 2015. Analyzing Public Discourse: Using Media Content Analysis to Understand the Policy Process. _Current Issues in Comparative Education_ 18(1), 26–41.


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles