Visualizing Graph Embeddings with t-SNE in Python

How to qualitatively evaluate Neo4j graph embeddings

Jun 25, 2021

7 min read

Hands-on Tutorials

Image by Martin Grandjean, licensed under the Creative Commons Attribution-Share Alike 4.0 International license. No changes were made to the original image.

Introduction

In my previous post we discussed the purpose and nature of graph embeddings. The main idea is that to do machine learning on a graph we need to convert the graph into a series of vectors (embeddings) that we can then use to train our machine learning (ML) models.

The catch is that graph embeddings can be difficult to tune. Similar to other ways to create embeddings and the models they are used for, there are a lot of hyperparameters that we need to consider and optimizing them to the specific application takes time. The subject of tuning the embeddings is something I will save for a future post.

This post is about working to develop an intuition about the hyperparameters of the embeddings. More specifically, I am going to show you how you can inject graph embeddings created by the Graph Data Science (GDS) library in Neo4j and then visualize them with a Streamlit dashboard. Tomaz Bratanic developed the capability to visualize embeddings in the NEuler Neo4j tool, but I am going to demonstrate how to come at this in a purely pythonic way.

You can find all of the code for this tutorial in this GitHub repo.

Getting started

The first thing we need to do is to create a graph to use for embedding creation using a free Neo4j Sandbox instance. For this demonstration we are going to use a graph that is pre-built into one of the Sandboxes, namely the Game of Thrones graph that has been the source of a lot of graph examples. (This graph goes through the first 5 books and I will warn you that Game of Thrones spoilers are coming up!)

Neo4j Sandbox

When you get to the main Sandbox page, you will want to select the Graph Data Science type with pre-built data and launch the project:

Select the Graph Data Science image with pre-built data. (Image by author.)

You will see the instance be created. Then you want to grab the connection details using the drop down to the right:

The connection details of the instance. (Image by author.)

Cool. Be sure to grab the Bolt URL and password, since those will be used to make our connection. If you click on Open: Open with Browser and then click on the database icon at the upper left, you should see that you have a pre-populated graph of 2642 nodes representing people, places, etc. and 16,474 relationships of a variety of types.

Nodes and relationships for the Game of Thrones graph. (Image by author.)

At this point, you will want to go into the repo (you cloned it, right?) and we will adjust the Dockerfile with this information. I like to use Docker so the results are reproducible. So based on the above image, you will edit the last line of the Dockerfile to read

(This Sandbox instance will be taken down by the time this article is published.)

Excellent! Now we can build the Streamlit container that will be used to make the connection. I really like Streamlit because it allows you to quickly create dashboards, which can be exceptionally sophisticate, with a minimal amount of coding overhead. We will now build the container via the command line with the standard command:

docker build -t neo_stream .

and then we will fire it up with

docker run -p 8501:8501 -v $PWD/src:/examples neo_stream

(Note that you will need to adjust these commands if you are on Windows.)

Visualizing embeddings in the dashboard

We now have our Streamlit container up and running, connected to our Neo4j Sandbox instance. Now that the container is running it will provide you the URL that you should navigate to with your browser. It will look something like http://172.17.0.2:8501. When you do so, you should see something that looks like this:

Screenshot of Streamlit dashboard. (Image by author.)

Nice. Now let’s see what we have here. The first thing we can do is hit "Get graph list." This will do two things. First, if you get some text back, you know that you have correctly made the connection to the Sandbox. Second, if there are any in-memory graphs created by GDS (see the API docs and this post to learn about those), then they will be listed here. Since we are just getting started, there shouldn’t be any.

But we are going to create one now since they are the backbone of all of GDS. Give the graph a name and click "Create in-memory graph" (no spaces!). This is going to create a monopartite, undirected graph looking at which people in Game of Thrones interact with which other people, or (person1:Person)-[:INTERACTS]-(person2:Person) . When we do this, we will wind up with a graph that is 2166 nodes and 7814 relationships. Note that an undirected graph will double the number of relationships since it considers both orientations from person1 to person2 and person2 to person1. Spoiler alert: your embeddings will look different if you go with the natural orientation of the graph: (person1:Person)-[:INTERACTS]->(person2:Person) .

Alright, now it is time to get to work and create some embeddings. As of the writing of this post, I have implemented two of the easier embeddings built into GDS, namely FastRP and node2vec. If you read the API docs, there are a lot of hyperparameters that you can and should play with because, as we know, defaults tend to not give the best results. I have included only a subset for each approach, but will add more in the future. For FastRP, I have the following:

FastRP hyperparameters included in the dashboard. (Image by author.)

You can also click on the drop for node2vec and see what is tunable in there. I highly recommend you consult the API docs for each embedding method to get more information on what each means as describing each of them in detail is beyond the scope of this post (although in the future posts on embedding tuning we will get into the weeds on that!).

So you can create both FastRP and node2vec embeddings. Now we want to visualize them. But for what goal? Let’s see if we can’t predict which characters are alive and dead at this point. This is a very basic node classification problem and it is a great starting point since this is a supervised learning problem. In the case of this data, I have labeled each Person nodes as 1 if they are alive and 0 if they are dead.

We will use the t-Distributed Stochastic Neighbor Embedding (t-SNE) available in [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)to perform a dimensionality reduction for the purposes of visualization in 2-dimensional space. (You could use any dimensionality reduction approach here, such as PCA. My choice of t-SNE is arbitrary.)

I am going to pick some random values and generate some node2vec embeddings as shown here:

Demonstration hyperparameters for my node2vec embeddings. (Image by author.)

Next, I am going to visualize these embeddings with the t-SNE tab. When I do that, I get this:

2D t-SNE vectors based on above node2vec embeddings. (Image by author.)

Oh no! This is horrible! The red data points are the dead people and the blue ones are the living. We would hope that our red and blue points would cluster much better than this! I leave it as an exercise to the reader to tinker with these values and see if they can do better. (Trust me, you can!)

Next steps

There are some things to consider here. First, this is a very small graph. So really truly optimizing it is going to be hard in general. Second, if we really want to optimize them, we need to do more than look at a pretty picture in 2D of embeddings that are of much higher dimension. We might use a tool like this to gain the intuition of which hyperparameters are the most important for our graph and then use those for a grid search within an ML model for optimizing to our relevant metrics. Therefore, I encourage you to tinker around with these hyperparameters and see what happens.

In future posts I plan to use graphs that are much larger where we can hopefully get some better embedding results and put them through their paces in ML models.