What Do Molecules That Look LIke This Tend To Do?

In this post, we'll take a look at how we can use an Open Source Python library to search the ChEMBL database and investigate the biology associated with compounds similar to a screening hit.  The approach we'll discuss is easy to set up and doesn't require any database installation or configuration.  As usual, the code associated with this post is available on GitHub


Introduction
A question that invariably comes up when examing screening hits is "what do molecules that look like this tend to do?".   This question can come up in a couple of contexts. 

  • You've run a target-based screen, found a compound that's active in a functional assay, and you'd like to identify other targets that the compound could hit.  This might provide a pointer to selectivity assays that could/should be run. 
  • You've run a phenotypic screen, and you'd like some hints on targets that could be responsible for the observed activity. 
One approach to answering these questions is to perform a similarity search in a database containing chemical structures and associated biological activities.  One such collection is the ChEMBL database maintained by the European Bioinformatics Institute (EBI).  ChEMBL currently contains almost 2 million compounds, with more than 16 million associated biological activities.  By searching ChEMBL, we can find the activities of compounds similar to our compound of interest. 

However, it is important to remember that there are caveats associated with this approach.  
  • The search will only be as good as the data in the database.  If the database doesn't contain molecules similar to our compound of interest, or relevant data on those compounds, this approach won't help. 
  • Chemical similarity is an imperfect concept; similar compounds don't guarantee similar biological activity. 

Searching ChEMBL
There are several ways of performing similarity searches. One of the easiest and fastest with a database like ChEMBL is to use the FPSim2 package, developed by the group at the EBI that maintains the ChEMBL database.  FPSim2 is well documented and can be installed with a simple "conda install" command.   Once installed, FPSim2 provides the ability to create and search a compressed fingerprint database.  

Of course, we don't simply want to search ChEMBL for similar compounds. We also want to see biological activities associated with these compounds.  To do this, we'll write a Python script that first uses FPSim2 to find similar compounds, then uses this SQL query to pull out the associated data. 

select canonical_smiles, cs.molregno, standard_type, standard_value, standard_units, doi, a.description 
from compound_structures cs
join activities act on  cs.molregno = act.molregno
join docs d on act.doc_id = d.doc_id
join assays a on act.assay_id = a.assay_id
where cs.molregno = {molregno}
and act.standard_relation = '='
and act.standard_type in ('IC50', 'Ki', 'EC50')

This query is pretty simple; it joins the tables with chemical structures, biological activities, and assay descriptions.   We only look for IC50, Ki, and EC50 data, and we don't consider data with an associated operator (e.g. >30).  This may not exactly meet your requirements, but it should be easy enough to change.  

Databases For People Who Aren't Database Geeks
One way to use the ChEMBL database is to install it into a relational database like MySQL or PostgreSQL.  These databases are powerful tools, but they can be intimidating for those who are not experienced database geeks.  To use these databases, one has to go through a relatively complex installation procedure that involves setting up users, permissions, and security.  It's not that hard, but it's a lot of work if you just want a database for your own use.  An alternative is to use SQLite, a very fast, file-based database that's incredibly easy to use.   As an added benefit, the SQLite database interface is part of the standard Python distribution, so there's nothing extra to install.  Fortunately for us, the ChEMBL group has an SQLite version of the database available, so all we have to do is download it and use it.  

Searching ChEMBL For Similar Compounds
The GitHub repo associated with this post contains two scripts that enable similarity searching in ChEMBL.  Once you have downloaded the SQLite version of ChEMBL, you need to do two things.  

1. Generate fingerprints for the database.  This only needs to be done once.  We do this using the script create_fp_sim2_db.py.  To use this script, we simply supply the name of the ChEMBL database file and the name of the fingerprint file that we're creating.  The fingerprints are created as an HDF5 file, so I usually use an h5 extension, this isn't necessary, but it helps me remember what the file is. 

create_fpsim2_db.py chembl_27.db chembl_27.h5

2. Search the database with the script chembl_sim_search.py.   This script takes a SMILES file with the query molecules as input.  It expects each line to contain two tokens, the SMILES, and a molecule name. 

Nc1nccc(n1)-c1c(ncn1CC1CC1)-c1ccc(F)cc1 1bmk
O=C\2c1c(cccc1)N/C=C/2C(=O)Nc3cc(O)c(cc3C(C)(C)C)C(C)(C)C ivacaftor

The script takes two required and one optional argument. 

Usage: chembl_sim.py --query QUERY_SMI --out OUT_CSV [--sim SIM_CUTOFF]

--help print this help message
--query QUERY_SMI query SMILES file (takes the form SMILES name)
--out OUT_CSV output csv file
--sim SIM_CUTOFF similarity cutoff, default=0.7

As an example, we can perform a similarity search using the default cutoff with this command. 

chembl_sim_search.py --query query.smi --out out.csv

The searches are pretty quick.  I was able to search the two query SMILES above in 5.5 seconds. 

If we want to change the default Tanimoto cutoff, we can do this. 

chembl_sim_search.py --query query.smi --out out.csv --sim 0.6

The output of this search is a csv file containing these columns. 
  • query_smiles - SMILES for the query
  • query_name - name for the query
  • query_sim - Tanimoto similarity of the query to the molecule in the database
  • canonical_smiles - SMILES for ChEMBL hit
  • molregno - ChEMBL name for the hit
  • standard_type - asaay result type (e.g. IC50, EC50, Ki)
  • standard_value - asaay value
  • standard_units - assay unit
  • doi - DOI for the paper containing the assay result
  • description - assay description 
Conclusions
Searching ChEMBL can provide a quick means of investigating the biology associated with a set of screening hits.  While the ChEMBL database provides an interface for single searches, batch searching for several query molecules can be cumbersome, and could be better performed using a script like the one discussed here.  The FPSim2 library is only one means of performing this type of search.  Another approach, discussed in a recent post from Iwatobipen, is to use ChemicalLite, a chemical database cartridge for SQLite.  This appears to be more complicated to set up but allows both substructure and similarity searching.  

One potential downside to the script described in this post is that it outputs a csv file that isn't easy to view or use without additional software.  In a future post, we'll build a viewer to look at our similarity search results. 







Comments

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)

Getting Real with Molecular Property Prediction