Self-Organizing Maps - The Code (Part 2)

In this post, we will look at examples of how two different open source Python libraries can be used to generate self-organizing maps.  The MiniSom library is great for building SOMs for smaller sets with fewer than 10K molecules.   The Somoclu library can use either a GPU or multiple CPU cores to generate a SOM, so it's well suited to larger libraries.  While Somoclu is a lot faster than MiniSom, installation on non-Linux platforms can require a bit of extra work.

I've provided example use cases for both libraries as Jupyter notebooks.  Hopefully, this will make it easier for readers to experiment with these methods.

MiniSom

The MiniSom library is great for generating SOMs for smaller datasets consisting of thousands to tens of thousands of molecules. I found the MiniSom library easy to install on a Mac or a Linux platform.   The MiniSom example notebook can be found here on GitHub.

Here's some benchmarking data using MiniSom.  In the plot below we compare the time required to generate a SOM with 1,000, 5,000 and 10,000 molecules with 166-bit MACCS Keys and 1024 bit Morgan fingerprints.  We set the number of cells in the SOM using the heuristic defined in the MiniSom docs.  

number of cells = 5 * sqrt(number of molecules)

In addition, we set the maximum number of training cycles to 10*number of molecules.  The benchmarks were all run on my 2017 MacBook Pro. 

Somoclu

The Somoclu library can use a GPU or multiple CPU cores to accelerate the generation of the SOM.  This library is a great choice for Linux systems and can be installed with a simple pip install.  Unfortunately, since the default compilers on Macs don't support OpenMP, installation on non-Linux platforms can be more difficult.  The Somoclu example notebook can be found here on GitHub

Here's some additional benchmarking data showing the time required to generate a SOM with Somoclu.  Since the Somoclu library has the ability to take advantage of multiple cores or a GPU, we benchmark on a larger dataset with 100,000 molecules.  All of the benchmarks were run on a p2.xlarge instance on AWS (1 NVidia K80, 4vCPUs).  Note that the current cost for an instance like this is only $0.90/hr.   It's interesting that the GPU implementation is consistently slower than the multicore CPU implementation.

A Couple of Useful Functions

The Jupyter notebooks linked above include a function which plots the SOM as a series of pie charts showing the distribution of active and inactive compounds in each cell (see below).  I've found plots like this useful when examining the results of HTS and other screens.  These plots provide a quick overview of how the hits are distributed over the chemical space of the compounds that were screened.


In the plot above, the hits are shown in blue, while the inactive compounds are shown in orange.  We can see that the majority of the hits are in three cells, while a few other cells contain smaller fractions of hits.  The SOM not only enables us to quickly locate clusters of hits but also lets us find cells which contain a mixture of active and inactive compounds.  These cells can be the most interesting since they often point out small changes in chemical structure that are responsible for larger changes in biological activity. 

I've also included a function that shows how to examine the contents of a particular SOMcell by generating a grid showing the structures of the molecules contained in that SOM cell.  Ideally, we'd like to have a function that will allow us to click on a cell and see a grid with the molecules in that cell.  I typically do this with Vortex from Dotmatics.  It shouldn't be that hard to put together a web app to do this, but that's something we'll leave for another day.


Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Comments

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)

Getting Real with Molecular Property Prediction