Self-Organizing Maps - 90s Fad or Useful Tool? (Part 1)

In this post, I will explain how self-organizing maps (SOMs) work.  In the first part of this post, I'll explain the technological underpinnings of the technique.  If you're impatient and just want to get to the implementation, skip to part 2.

A few years ago I was having a discussion with a computational chemistry colleague and the topic of self-organizing maps (SOMs) came up.   My colleague remarked, "weren't SOMs one of those 90s fads, kind of like Furbys"?  While there were a lot of publications on SOMs in the early 1990s, I would argue that SOMs continue to be a useful and somewhat underappreciated technique.

What Problem Are We Trying to Solve?

In many situations in drug discovery, we want to be able to arrange a set of molecules in some sort of logical order.  This can be useful in a number of cases.
  1. Clustering.  Sometimes we want to be able to put a set of molecules into groups and select representatives from each group.  This may be the case when we only have resources to screen a subset of the compounds we have available.  In this case, the SOM performs a function similar to what we get with k-means clustering.  One advantage of the SOM is that it can provide a visual representation of the relationships between clusters. 
  2. Understanding the results of high-throughput (or other) screens.  When we run a screen, one of our first objectives is to examine the results and identify potential SAR.   One way to do this is to create a projection where similar compounds are in similar parts of space.  As we will see in subsequent sections, SOMs can provide an overview of how hits are distributed in chemical space. 
  3. Comparing compound sets.  Let's say that we have a database of compounds for fragment screening.  If we have the opportunity to add new compounds to our fragment database, we would first like to see how similar the new compounds are to our current collection.  The SOM provides a convenient method of visualizing which parts of chemical space are covered by each collection. 
How Does the SOM work? 

In order to explain SOMs we need to take a step back and take a look at what we can do with a set of colors.  Let's say that we have a random set of colors like the one shown on the left and we want to arrange these colors logically like the ones on the right.



As we all probably know, colors can be represented as a vector of three integers between 0 and 255 with the values representing the red, blue, and green components of the color.  


Since we have colors represented as vectors, we can calculate the Euclidean distance between these vectors to identify which colors are most similar.   In the example below, we calculate the distance between a vector representing the color red and vectors representing other colors.   As we can see in the figure below, red is closest to purple and farthest from green. 


Let’s say that we want to map colors into a 10x10 grid.  We can start by creating a 10x10 array of vectors.  Each vector consists of 3 elements, each of which is assigned a random value between 0 and 255.  


Now that we have a grid initialized with random vectors,  we will select a random color.  In this case, we'll select blue. 




We then calculate the Euclidean distance between our selected vector (0,0,255) and the random vectors we created above.  We then find the closest cell to the vector representing blue.  We refer to this cell as the best matching unit (BMU).



We then add some "blue" (the vector (0,0,255)) to the cells surrounding the BMU.  An exponential function is used to adjust the amount of "blue" added to surrounding cells.  We add large amounts of "blue" to adjacent cells.  A decreasing amount of blue is added to cells as we get farther away from the BMU. 

We then repeat this process of selecting a random color, identifying the closest cell to that color, and adding the selected color to the adjacent cells.  Repeating this process over thousands of cycles allows us to arrange a set of vectors so that similar vectors are close together and different vectors are far apart.  The video below shows how the SOM moves from a random set of colors to a set that is organized so that similar colors are close together. 



Remember that we are not really dealing with colors, we are simply adding vectors to create a set of basis vectors.  Any new vector can then be placed into the appropriate cluster by calculating the Euclidean distance from that vector to each basis vector.  The cell with the smallest Euclidean distance is the BMU.  

Applying the SOM to Molecules

At this point, you have probably figured out that we can use the same techniques with molecules that we just showed with colors. We can use any of a variety of methods to represent a molecule as a bit vector, where each bit represents the presence or absence of a particular substructure. 


If we're building a SOM for molecules, we begin by initializing the SOM with random vectors.  We then select a random molecule from our dataset, calculating its fingerprint and treating this fingerprint as a vector.  Once the fingerprint has been generated, we calculate the Euclidean distance from this vector to each of our random vectors.  The random vector with the smallest Euclidean distance is selected as the BMU.  The weights of the surrounding cells are then adjusted as shown above.  This process is repeated several thousand times to train the SOM.  After training, we have a set of basis vectors that can be used to cluster new molecules.  One of the great aspects of this technique is that we can train the SOM on a subset of the data.  If we have a set of 1 million molecules, we can typically train the SOM on a smaller subset (perhaps 100K) of molecules.  We can then use the basis vectors generated during training, to cluster the entire 1 million molecule database. 

That's it for the preliminaries.  If you'd like to see how this works in practice, please check out part 2 of this post.


Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Comments

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)

Getting Real with Molecular Property Prediction