
Introduction
In the real world, text data is frequently collected as time series. Some examples include companies collecting product reviews where the quality of their products might change. Politicians’ public statements can vary over the political cycle. Central bankers’ announcements are one of the ways how central banks affect the financial markets nowadays. For these reasons, text data often has a time dimension and is recorded with a date/time column.
Exploratory data analysis (EDA) of these datasets is not a trivial coding exercise. And here comes Arabica – to make things simpler. This article covers the following:
- new aggregation and cleaning features in Arabica 1.0
- real-world applications of time-series text data analysis
We will show what’s new in Arabica 1.0 using Twitter Data of 2016 USA Presidential election tweets from data.world. The dataset is licensed under the general Public domain license.
EDIT Jan 2023: Arabica has been updated. Check the documentation for the full list of parameters.
What is Arabica?
Arabica is a Python library for exploratory data analysis specifically designed for time series text data. My previous Towards Data Science article provides an elementary introduction. I recommend reading it before proceeding further in this article as a refresher.
It can be installed with pip:
pip install arabica
Arabica takes text and time as the input, enables standard cleaning operations (numbers, punctuation, and stopwords removal), and provides unigram bigram and trigram frequencies over a **** selected period.

As input text, it works with texts of languages based on the Latin alphabet and enables stopword removal for languages in the ntlk **** corpus of stopwords.
For time specification, Arabica currently reads dates in standard date and datetime formats and provides aggregated n-grams in yearly, monthly, or daily frequencies. The standard formats are, e.g., 2013-12-31, 2013/12/31, 31-Dec-2013, 2013-12-31 11:46:17.
Be careful about the European vs. US date format difference. It is recommended to use the US-style dates (MM/DD/YYYY) rather than the European style (DD/MM/YYYY) since there might be a mismatch between months and days in small datasets.
Arabica now has a visualization module providing word clouds, heatmaps, and line plots for unigrams, bigrams, and trigrams frequencies. See this tutorial for more examples.
Twitter data use case
Let’s see how it works in the code. We’ll first import arabica and pandas and read the data.
import pandas as pd
from arabica import arabica_freq
The subset of data we’ll be using contains President Donal Trump’s tweets during the last weeks of his presidential campaign and the first couple of weeks after he took office. Here is what the data looks like:

Extended aggregation options
Time-series n-gram analysis
We can see the data is pretty raw. It contains lots of numbers, special characters, punctuation, and strings we don’t need. We will also skip variations of politicians’ names and other strings specific to this dataset. The text is also made lowercase so that capital letters don’t affect n-gram calculations (e.g., "Tree" is not treated differently from "tree").
In Arabica, it is all straightforward:
arabica_freq(text = data['tweet_text'], # Text
time = data['created_at'], # Time
time_freq = 'D', # Aggregation period
max_words = 2, # Max number for n-grams to be displayed
stopwords = ['english'], # Language for stop words
skip = ['realdonaldtrump','trump', # Remove additional strings
'makehillarybrokeagain',
'hillaryclinton',
'barackobama',
'hillary', 'clintons',
'wwmorgan','jordanla',
'whoisNeil','cattmaxx',
'vnkbgygv', 'httpstco',
'httpstcob', 'httpstcod',
'nrclwvhx','httpstcoj',
'httpstcossh'],
lower_case = True, # Lowercase text before cleaning and frequency analysis
numbers = True, # Remove all digits
punct = True) # Remove punctuation
The n-gram frequencies, especially bigrams and trigrams, tell us more about President Trump’s public communication after he took over the presidency. It is better to increase the value for themax_words parameter (say 4 or 5) to explore it deeper.

Descriptive n-gram analysis
In certain situations, we need to make a simple EDA of the text dataset as the first step before developing more complex analyses. In the code, setting the parameter time_freq = 'ungroup' means that Arabica calculates the unigram, bigram, and trigram frequencies of the whole data, and no time aggregation is made.
Running this code, we create a result dataframe that can be later saved as a CSV file:
result = arabica_freq(text = data['tweet_text'], # Text
time = data['created_at'], # Time
time_freq = 'ungroup', # No time aggregation made
max_words = 6, # Shows 6 most frequent n-grams
stopwords = ['english'], # Language for stop words
skip = ['realdonaldtrump','trump', # Strings to remove
'makehillarybrokeagain',
'hillaryclinton',
'barackobama',
'hillary', 'clintons',
'wwmorgan','jordanla',
'whoisNeil','cattmaxx',
'vnkbgygv', 'httpstco',
'httpstcob', 'httpstcod',
'nrclwvhx','httpstcoj',
'httpstcossh'],
lower_case = True, # Lowercase text
numbers = True, # Remove all digits
punct = True) # Remove all punctuation
Here is the output table:

Extended cleaning options
In Arabica 1.0. we can remove more sets of stopwords at once. It is useful for text mining data from countries with more than one official language (Canada, Switzerland, Belgium, Luxembourg, etc.) or countries with large foreign communities.
To remove English, French, and German stopwords from the text, modify the code in this way:
arabica_freq(text = data['tweet_text'],
time = data['created_at'],
time_freq = 'D',
max_words = 2,
stopwords = ['english','french','german'],
skip = None,
lower_case = True,
numbers = True,
punct = True)

Real-life applications
Let’s mention several examples where Arabica might be helpful. In Marketing, businesses use market positioning methods to influence consumer perception regarding a brand or product relative to competitors. The objective here is to establish the image or identity of a company or product so that consumers perceive it in a certain way.
Product and site reviews provide excellent data to evaluate positioning strategies. Did we receive bigrams such as "fast delivery," "excellent service," "great service," "quick delivery," or "good prices" with high frequency? Does it align with how we want our customers to perceive our product? Does it change over time?
In Politology and Political Economics, analyzing public discourse with **** content analysis methods is widely spread (e.g., Saraisky, 2015). The essential topics we might focus on here are, e.g., the consequences of fake news and populism, the public attitude towards specific problems (immigration, weapons possession, etc.), and case studies of individual politicians or political parties. Twitter tweets provide a good data source for these investigations.
Not only could these issues be studied on time-series text datasets. You may find many other use cases where text has a time-series character. Some of them I have developed in Arabica’s documentation.
EDIT: Arabica now has a visualization module to display text as a heatmap, word cloud, and line plot, and a sentiment and structural breaks analytical module. Read more in tutorials for visualization, sentiment analysis, meta-data analysis, and customer satisfaction measurement.
Did you like the article? You can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!
References:
[1] Saraisky, N.G. 2015. Analyzing Public Discourse: Using Media Content Analysis to Understand the Policy Process. _Current Issues in Comparative Education_ 18(1), 26–41.





