Visualising Embeddings with t-SNE

Nadine Amersi-Belton
Analytics Vidhya
Published in
4 min readNov 2, 2020

--

In this blog post, we will be visualise embeddings of video games (based on data from Steam.)

Photo by Sean Do on Unsplash

We will start with the following DataFrame relating to Steam computer games. Here the uid represents a unique user id and id represents a unique game id. We have over 4 million rows, each representing a uid/id relationship (namely user uid owning game id ). To see the preprocessing steps which led to this stage, please see the full project on my GitHub.

We then transform this DataFrame to have each game id as a column and the users as rows, creating an interactions matrix which we can then feed into the model.

We will use the LightFM library to generate recommendations and retrieve embeddings. The documentation goes into detail as to how the model works. We can choose the number of components that the model should learn (i.e. the dimensionality of the latent space). We varied this hyperparameter and found it had little effect on the overall model performance and so left it at 30.

We then extract the embeddings, as follows.

# Get embeddings
embeddings = mf_model.item_embeddings

Let’s investigate a sample embedding.

To retrieve the name of the game, we first look up the game id using our interactions matrix and then obtain the name using a dictionary we previously created, which maps id to title name.

firstgameid = interactions.columns[0]
games_dict[firstgameid]

We will now use the t-SNE algorithm to visualise embeddings, going from a 30-dimensional space (number of components) to a 2-dimensional space. t-SNE is used for dimensionality reduction as it learns a mapping from a high-dimensional vector space to a lower dimensional space where if vectors u and v are close to each other, then their respective mappings in lower dimension space will also be close to each other.

We import t-SNE and instantiate it.

from sklearn.manifold import TSNE# Instantialte tsne, specify cosine metric
tsne = TSNE(random_state = 0, n_iter = 1000, metric = 'cosine')

We then call fit and transform on the embeddings matrix.

# Fit and transform
embeddings2d = tsne.fit_transform(embeddings)

Finally, to make plotting easier, we build a pandas DataFrame with the results.

# Create DF
embeddingsdf = pd.DataFrame()
# Add game names
embeddingsdf['game'] = gameslist
# Add x coordinate
embeddingsdf['x'] = embeddings2d[:,0]
# Add y coordinate
embeddingsdf['y'] = embeddings2d[:,1]
# Check
embeddingsdf.head()

Finally, we plot our 2 dimensional game space.

# Set figsize
fig, ax = plt.subplots(figsize=(10,8))
# Scatter points, set alpha low to make points translucent
ax.scatter(embeddingsdf.x, embeddingsdf.y, alpha=.1)
plt.title('Scatter plot of games using t-SNE')plt.show()

We can check that the plot ‘worked’ by seeing where games we believe to be similar are. Let’s see where the RollerCoaster Tycoongames are on the plot.

match = embeddingsdf[embeddingsdf.game.str.contains('RollerCoaster')]

As we can see, RollerCoaster Tycoon World is ‘different’ from the others, in the sense that users reacted differently to it.

Here they are on the plot:

This insight is useful for stakeholders as without the data and the embeddings, one would assume that all RollerCoaster Tycoon games are similar.

From additional research, it is apparent that RollerCoaster Tycoon World flopped, with poor reviews from both general users and critics, who complained of the graphics and bugs.

With this knowledge, when recommending similar items to one RollerCoaster Tycoon game, we would advise avoiding RollerCoaster Tycoon World.

As a final thought, we notice a pocket of games in the top middle, which is interesting. What makes these games different to others? To be continued…

--

--