Publish AI, ML & data-science insights to a global community of data professionals.

Visualization Module in Arabica Speeds Up Text Data Exploration

Arabica now offers unigram, bigram, and trigram word cloud, heatmap, and line chart to further accelerate time-series text data analysis

Figure 1. Bigram word cloud, image by author.
Figure 1. Bigram word cloud, image by author.

Introduction

Arabica is a python library for exploratory text data analysis focusing on text from a time-series perspective. It reflects the empirical reality that many text datasets are collected as repeated observations over time. Time series text data include newspaper article headlines, research article abstracts and metadata, product reviews, social network communication, and many others. Arabica __ simplifies exploratory analysis (EDA) of these datasets by providing these methods:

  • arabica_freq: descriptive n-gram analysis and time-series n-gram analysis, for n-gram based EDA of text dataset
  • cappuccino: for visual exploration of the data.

This article provides an introduction to Cappuccino, Arabica’s visualization module **** for exploratory analysis of time-series text data. Read the documentation and a tutorial here for a general introduction to Arabica.

EDIT Jan 2023: Arabica has been updated. Check the documentation for the full list of parameters.

2. Cappuccino, visualization for exploratory text data analysis

The plots implemented are word cloud (unigram, bigram, and trigram versions), heatmap, and line plot. They help discover (1) the most frequent n-grams for the whole data reflecting its time-series character (word clouds) and (2) n-grams development over time (heatmap, line plot).

The graphs are designed for use in presentations, reports, and empirical studies. They are, therefore, in high resolution (pixels depend on the data range displayed in the graphs).

Cappuccino relies on matplotlib, worcloud, and plotnine to create and display graphs, and cleantext and NTLK corpus of stopwords for pre-processing. Plotnine implements the popular and widely used ggplot2 library into Python. The requirements are here.

The method’s parameters are:

def cappuccino(text: str,                # Text
               time: str,                # Time
               plot: str = '',           # Chart type: 'wordcloud'/'heatmap'/'line'
               ngram: int = '',          # N-gram size, 1 = unigram, 2 = bigram, 3 = trigram
               time_freq: str= '',       # Aggregation period: 'Y'/'M'', if no aggregation: 'ungroup'
               max_words int = '',       # Max number for most frequent n-grams displayed for each period
               stopwords: [],            # Languages for stop words
               skip: [ ],                # Remove additional strings
               numbers: bool = False,    # Remove numbers
               punct: bool = False,      # Remove punctuation
               lower_case: bool = False  # Lowercase text
)

3. Descriptive n-gram visualization

Descriptive analysis in Arabica provides n-gram frequency calculations without aggregation over a specific period. In simple terms, first, n-grams frequencies are calculated for each text record, second, the frequencies are summed for the whole dataset, and finally, the frequencies are visualized in a plot.

Word cloud

Let’s illustrate the coding on the Million News Headlines of news headlines published in daily frequency over 2003–2–19: 2016–09–18. The dataset is provided by the Australian Broadcasting Corporation under the CC0: Public Domain license. We’ll subset the data to the first 50 000 headlines.

First, install Arabica with pip install arabica, then import Cappuccino:

from arabica import cappuccino

After reading the data with pandas, the data looks like this:

Figure 2. Million News Headlines data
Figure 2. Million News Headlines data

We lowercase the text, clean the data from punctuation and numbers, remove English stopwords and other unwanted strings ("g", "br"), and plot a word cloud with the 100 most frequent words:

cappuccino(text = data['headline'],
           time = data['date'],
           plot = 'wordcloud',
           ngram = 1,               # n-gram size, 1 = unigram, 2 = bigram, 3 = trigram
           time_freq = 'ungroup',   # no period aggregation
           max_words = 100,         # displays 100 most frequent words
           stopwords = ['english'], # remove English stopwords
           skip = ['g','br'],       # remove additional strings
           numbers = True,          # remove numbers
           punct = True,            # remove punctuation
           lower_case = True        # lowercase text
)

It returns the word cloud:

Figure 3. Word cloud, image by author.
Figure 3. Word cloud, image by author.

After changing ngram = 2 , we receive a word cloud with the 100 most frequent bigrams (see the cover picture). Alternatively, ngram = 3 displays the most frequent trigrams:

Figure 4. Word cloud - trigram, image by author.
Figure 4. Word cloud – trigram, image by author.

4. Time-series n-gram visualization

Time series text data typically display variability over time. Political statements before elections and newspaper headlines during the Covid-19 pandemic are nice examples. To display the n-grams over time, Arabica implements a heatmap and a line plot for monthly and yearly periods.

Image by author, source: Draw.io
Image by author, source: Draw.io

Heatmap

A heatmap with the ten most frequent words in each month is displayed with the following code :

cappuccino(text = data['headline'],
           time = data['date'],
           plot = 'heatmap',
           ngram = 1,               # n-gram size, 1 = unigram, 2 = bigram
           time_freq = 'M',         # monthly aggregation
           max_words = 10,          # displays 10 most frequent words for each period
           stopwords = ['english'], # remove English stopwords
           skip = ['g', 'br'],      # remove additional strings
           numbers = True,          # remove numbers
           punct = True,            # remove punctuation
           lower_case = True        # lowercase text
)

The unigram heatmap is the output:

Figure 5. Heatmap - unigram, image by author.
Figure 5. Heatmap – unigram, image by author.

The unigram heatmap gives us the first look at the variability of data over time. We can clearly identify the important patterns in the data:

most frequent n-grams: "us", "police", "new", "man".

outliers (terms appearing only in one period): "war", "wa", "rain", "killed", "iraqi", "concerns", "budget", "bali".

We might consider removing the outliers in the later stage of the analysis. Alternatively, we create a bigram heatmap by changing ngram = 2 and max_words = 5 displaying a heatmap with the five most frequent bigrams in each period.

Figure 6. Heatmap - bigram, image by author.
Figure 6. Heatmap – bigram, image by author.

Line plot

A line plot with n-grams is displayed by changing plot = 'line'. By setting ngram parameter to 1 and max_words = 5 we create a line plot for the five most frequent words in each period:

Figure 7. Line plot - unigram, image by author.
Figure 7. Line plot – unigram, image by author.

Similarly by changing ngram = 2 and max_words = 3 the bigram line plot looks like this:

Figure 8. Line plot - bigram, image by author.
Figure 8. Line plot – bigram, image by author.

Final remarks

Cappuccino greatly helps in the visual exploration of text data which has a time-series character. With a single line of code, we pre-process the data and provide the first exploratory glimpse of the dataset. Here are several tips to follow:

  • The visualization frequency also depends on the length of the time dimension in the data. In long time series, a monthly plot will not display the data clearly, while a graph for short time series (less than a year) in yearly frequency will not provide any variability over time.
  • Select a suitable form of visualization on the basis of the dataset in your project. A line plot is not a good choice for datasets with high n-gram variability over time (see Fig 8). In this case, the heatmap shows a better picture even for many n-grams in each period.

Some questions we can answer with Arabica are (1) how the concepts in a specific domain (economics, biology, etc.) evolved over time, using research article metadata, (2) which key topics were emphasized during a presidential campaign, using Twitter tweets, (3) which parts of the brand and communication a company should improve, using customer product reviews.

The complete code in this tutorial is on my GitHub. For more examples, read the documentation and a tutorial on _arabicafreq method.

EDIT: Arabica now has a sentiment and structural breaks analytical module. Read more and also check practical applications in these tutorials:

Did you like the article? You can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!

Photo by Kanwardeep Kaur on Unsplash
Photo by Kanwardeep Kaur on Unsplash

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles