Exploratory Data Analysis of Text Data – Do it Faster with TextData

TextData simplifies exploratory data analysis of text data and saves a significant amount of coding time

Apr 14, 2022

4 min read

The data science workflow generally involves collecting, pre-processing, and cleaning data in the first phase. Exploratory data analysis (EDA) comes next and should never be skipped. By looking at the data, displaying summary statistics, and plotting charts, we can understand the dataset’s structure, find the outliers, observe the distribution of variables, and make initial hypotheses. Text data is no exception.

Performing an elementary EDA of text data in python, we can describe the data with pandas, plot the histograms and a heatmap with seaborn or plotly, do some programming and receive term frequencies, and use nltk to receive bigrams and trigrams. Finally, term frequencies can be displayed in bar charts with matpltlib or plotly. There are many other ways to do it, and although it is fun, it takes quite some coding time.

TextData is a python library designed to explore and analyze text data. It provides the primary methods for text data exploration and aims to do the essential EDA tasks efficiently, with as little coding as possible.

In this article, you’ll go over the essential EDA methods with TextData:

data summary of text corpus
bigrams and trigrams calculations
data visualization (histograms and bar charts).

Classic IMDb 50K Movie Reviews dataset fits well for our purpose (data license is here). A subset of 5000 reviews was already cleaned (numbers, stopwords, and special characters removed) and used for the corpus.

1. Prepare the corpus

Here is what the data looks like:

We create the corpus from the text of reviews as follows:

Let’s also split the corpus into positive and negative sentiment subsets and set their indexing:

Once we have created the corpus, we can use several functions to search for specific words, phrases, or documents that include them. Here, we query the corpus for the number and percentage of reviews that contain a bigram "excellent movie":

IMDb dataset includes 195 reviews containing this phrase which makes 3.9 % of all the reviews in our subset.

2. Summarize the data

We’ll now look at the corpus structure more closely, summarize the data and do some elementary frequency analysis.

There are 49 539 unique words in the corpus, making 611 098 words overall.

Most frequent words in reviews with the positive and negative sentiment labels are printed in this way:

Figure 2: Top 10 most frequent words by sentiment label

Let’s now print 10 most frequent bigrams from positive reviews. Correspondingly, we can calculate the most frequent trigrams.

Figure 3: Top 10 most frequent bigrams, positive sentiment

With a bit of help from numpy we’ll render a bar chart with the most frequent bigrams in this subset:

Quite strangely, the bar chart plots the phrases unordered.

Figure 4: Bar chart with top 10 bigrams, positive sentiment

3. Check the distribution

Checking the data distribution is a necessary part of EDA. TextData offers easy-to-implement histograms with a simple line of code.

Here, we plot the distributions of the length of reviews by sentiment label:

We can see that unhappy reviewers (reviews labeled with a negative sentiment) tend to write their hearts out, while satisfied reviewers instead write slightly shorter comments.

Figure 5: Histograms of review length by sentiment label

_TextData also offers some other practical methods (heatmap, frequency map, log-odds ratio, etc.). Don’t hesitate to explore them yourself here._

Conclusion

TextData improves the efficiency of people working with text data. A great advantage is integration with Altair, a powerful library for high-quality data visualization. It further speeds up the EDA investigations and makes the coding shorter.

On the downside, TextData currently does not provide a couple of graphics (e.g., word cloud) that one would expect from a complete text analytical tool. The library still seems to be under development, and some minor issues (e.g., the unordered bar chart in Fig. 4) should be resolved in later releases.

The complete python code is available on my GitHub, so feel free to use it.

Did you like the article? You can invite me for coffee and support my writing. You can also subscribe to my email list to get notified about my new articles. Thanks!

Written By

Petr Koráb

See all from Petr Koráb