Publish AI, ML & data-science insights to a global community of data professionals.

Arabica: A Python Package for Exploratory Analysis of Text Data

Arabica provides unigrams, bigrams, and trigrams frequencies by a period in a single line of code. Learn more in this tutorial.

Photo by Artem Sapegin on Unsplash
Photo by Artem Sapegin on Unsplash

Text data is often recorded as a time series with significant variability over time. Some examples of time-series text data include Twitter tweets, product reviews, and newspaper headlines. Exploratory analysis of such datasets is a non-trivial coding task. Arabica makes it simple in a single line of python code.

Arabica takes a data frame of text data as the input, enables standard cleaning operations (numbers, punctuation, and stopwords removal), and provides unigram (e.g., dog), bigram (e.g., dog, goes), and trigram (e.g., dog, goes, home) frequencies over a monthly or yearly period.

Figure 1: Scheme of arabica_freq method
Figure 1: Scheme of arabica_freq method

It uses cleantext, an excellent python library for punctuation cleaning, and ntlk corpus of stopwords for pre-processing. A list of available languages for stopwords is printed with:

Example use case

Let’s illustrate Arabica’s coding on the example of IMDb 50K Movie Reviews (see the data license). To add a time dimension, the time column contains synthetic dates in the ‘yyyy-mm-dd’ format. Here is what the subset of data looks like:

Figure 2: IMDb 50K Movie Reviews data subset
Figure 2: IMDb 50K Movie Reviews data subset

1. First look at the data

Let’s first look at the raw data in yearly frequency to find out more about the narrative of movie reviewers over time. We’ll read the data:

Then we call arabica_freq, specify yearly aggregation, keep numbers and punct as False, and stopwords as None, to look at raw data, including stopwords, digits, and special characters. max_words is set to 2 so that the output table is easy to read.

EDIT Jan 2023: Arabica has been updated. Check the documentation for the full list of parameters.

Here is the table of aggregated n-gram frequencies for the first six years:

Figure 3: arabica_freq output, yearly n-gram frequencies
Figure 3: arabica_freq output, yearly n-gram frequencies

We can see that the data contains lots of unnecessary prepositions and other stopwords that should be removed.


2. More detailed inspection of clean data

Next, we’ll remove numbers, punctuation, and English stopwords and display monthly n-gram frequencies to dig more into the clean data.

We can see a significant variation of text data over time (note that we work with a synthetic example dataset). The first five rows of the table are:

Figure 4: arabica_freq output, monthly n-gram frequencies
Figure 4: arabica_freq output, monthly n-gram frequencies

Photo by River Fx on Unsplash
Photo by River Fx on Unsplash

Conclusions

This package glorifies the best drink in the world, and hopefully, it will save you some time doing exploratory analysis of text data. The main Arabica’s benefits are:

  • coding efficiency – the EDA is done in one line of code
  • cleaning implementation – no need for prior text data pre-processing
  • solid performance – runs fast even with datasets of (tens of) thousands of rows.

Arabica is available from PyPI. For source files, go to my GitHub, and also read the documentation. Enjoy it, and please let me know how it worked on your projects!

EDIT: Arabica now has a visualization module to display text as a heatmap, word cloud, and line plot, and a sentiment and structural breaks analytical module. Read more in tutorials for visualization, sentiment analysis, meta-data analysis, and customer satisfaction measurement.


Did you like the article? You can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles