Publish AI, ML & data-science insights to a global community of data professionals.

Fixing Google Trends Data Limitations

Google Trends data suffer from several drawbacks. TrendEcon, a marvelous R package, tackles them and helps create consistent long-run time…

Photo by Claudio Schwarz on Unsplash
Photo by Claudio Schwarz on Unsplash

Google Trends (GT) is a publicly available database that provides aggregated data on search queries in Google across various regions and languages. Except for the web page allowing the users to download their pieces of data manually and Google Trends Datastore, providing pre-processed datasets on multiple topics, there are APIs for Python (PyTrends) and R that allows accessing the data automatically. A free Data Studio connector also enables real-time reporting with GT data.

Despite its intense use by companies and researchers, much less has been discussed about the peculiarities and limitations of this dataset. Google does not provide search volumes but aggregates them to an index that reflects the variation across time and relative differences to other search terms. Topics such as frequency inconsistency, seasonality, and random sampling, can bring a significant bias when analyzing the time series.

In this article, you’ll go over:

  • Limitations of GT data
  • TrendEcon R package that fixes lots of the potential problems resulting from the data normalization.

Google Trends limitations

Let’s go a bit deeper into the literature on the limitations of GT data. Eichenauer et al. (2021) point them out. The first comes from the construction of the data itself since, for privacy reasons, the raw volumes are transformed into an index.

How does Google transform the raw data?

  1. Each data point is divided by the total searches of the geography and time range it represents to compare relative popularity.
  2. The resulting numbers are then scaled on a range of 0 to 100 based on a topic’s proportion to all searches on all topics (Google, 2022).

As a result, the long-run consistency of the time series depends strongly on the time frame and frequency of the data. In principle, we can rely on these rules offered by Eichenauer et al. (2021):

  1. Monthly data captures the long‐term trend in search activity in the most accurate way.
  2. Weekly data is best to analyze the searches over a few weeks.
  3. Daily data is best to analyze short-term behavior over several days.

The aggregation of daily, weekly, and monthly series to lower frequency would lead to different results for any keyword. This frequency inconsistency means that daily data fail to capture long-run trends. There is, therefore, a trade‐off between using high‐frequency daily series neglecting long‐run trends versus time‐consistent monthly series, where search volumes are comparable at different and distant points in time.

Since GT data are time series, they contain all-time series components, including seasonality. Not including this element into the consideration (e.g., by restricting the window on a period with a strong seasonal effect) can also bring a bias into the interpretation of results.

Another limitation of the data arises from random sampling by Google. They explain the reason quite pragmatically:

"While only a sample of Google searches are used in Google Trends, this is sufficient because we handle billions of searches per day. Providing access to the entire data set would be too large to process quickly. By sampling data, we can look at a dataset representative of all Google searches, while finding insights that can be processed within minutes of an event happening in the real world." Google (2022).

However, in small countries or regions, the sampling variation of returned series might be high. There can be substantial sampling noise for keywords with limited search volume, even in large administrative units (Eichenauer et al., 2021). This means that by collecting times series repeatedly for different periods, we can access, to some extent, different datasets.

Photo by Tamara Gak on Unsplash
Photo by Tamara Gak on Unsplash

How can we deal with these shortcomings?

Fortunately, there is a fix to the bugs. Trendecon interacts with the Google Trends API and provides tools for constructing frequency‐consistent, long‐run time series with just one line of open‐source R code.

It deals with the limitations of the GT data. More specifically, it:

  • provides sampling adjustment by drawing multiple samples for each keyword and frequency
  • uses Chow and Lin’s (1971) disaggregation routine, principal component analysis (PCA), and other econometric methods to obtain time and frequency-consistent time series.

Let’s illustrate its application on several examples:

1. Downloading Google Trends data with R

First, to install TrendEcon, type:

To download a series for the keywords "ukraine", "putin", and "invastion" in the US, run:

Here is what the data look like:

Image by author, using ggplot2
Image by author, using ggplot2

2. Construct robust and consistent time series

For constructing a robust and consistent daily time series, daily, weekly and monthly data is downloaded and consistently aggregated, using the Chow-Lin methodology.

Here is the code for "ukraine" searches located in the US. It is very simple, TrendEcon does all the work for us:

The data show expected behavior:

Image by author, using ggplot2
Image by author, using ggplot2

3. Combine multiple keywords into a single time series

Sometimes, we need to monitor several keywords related to a specific topic. Say, for example, that we need to measure a brand awareness of a company called "Apollo-Optic" over time in Austria. Customers might search all variations of this brand, e.g., "apollo glasses", "apollo contact lenses", "apollo optik", etc.

Instead of monitoring the words separately, we can create a composite index from all keywords and use it in our reporting. TrendEcon performs a principal component analysis on the normalized series and extracts the first principal component as the common signal in the supplied keyword series (Eichenauer et al., 2021. This tutorial provides a step-by-step procedure for combining keywords into composite indicators.

Conclusion

TrendsEcon enhances the possibilities of working with GT data to researchers performing time-series investigations and analysts who set up reporting of important metrics for their companies. Here is the list of all functions for processing the data it provides. For further reading, you might check my article on building GT datasets with python.

PS: You can subscribe to my email list to get notified every time I write a new article.

References

[1] Chow, G.C., Lin, A. (1971). Best linear unbiased interpolation, distribution, and extrapolation of time series by related series. The Review of Economics and Statistics, 53(4), 372–375.

[2] Eichenauer, V. Z., Indergand, R., Martínez, I., Z., Sax, C. (2021). Obtaining consistent time series from Google Trends. Economic Inquiry, 60: 694–705.

[3] Google (2022). FAQ about Google Trends data. Retrieved from: https://support.google.com/trends/answer/4365533?hl=en.


Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles