Adding Chemical Structures to a Recent COVID-19 Drug Repurposing Dataset

In a preprint posted on bioRxiv on April 5, 2020, Franck Touret and coworkers from Aix Marseille Université published the results from a SARS-CoV-2 cellular assay of 1520 compounds from the Prestwick Chemical Library, a collection of off-patent marketed drugs.  Unfortunately, the authors published the preprint without including the chemical structures of the compounds.   Fortunately, Brian Cole used a couple of databases to associate structures with this screening data and posted the revised data, as well as the scripts he used to do the annotation on GitHub

This short post serves two purposes. 
  • It highlights data that may be useful to those working on treatments for SARS-CoV-2.  It may be possible to associate this screening data with recently released biophysical data and obtain mechanistic insights. 
  • The solution that Brian came up with is generally applicable,  and can be applied to the many cases where data is published without associated chemical structures.
In Brian's approach, he uses ChEMBL, a database of more than 2 million molecules and associated biological data, curated from the medicinal chemistry literature.   Each of the drug names supplied in the supporting information for the preprint is used to "lookup" the corresponding chemical structure in the ChEMBL database.   ChEMBL can be downloaded in a variety of formats, or accessed through an application programming interface (API).  For tasks like this, which require more than a thousand database queries, the direct method is preferred over the API.  

Brian's method begins with SQL to join the necessary database tables. 

select 
compound_structures.canonical_smiles as compound_smiles, 
molecule_dictionary.chembl_id as compound_chembl_id, 
molecule_dictionary.pref_name as compound_name, 
molecule_dictionary.max_phase as drug_development_phase 
FROM compound_structures 
JOIN molecule_dictionary 
ON compound_structures.molregno = molecule_dictionary.molregno

In this query, we are accessing two ChEMBL tables.  
  • compound_strucutures - contains the chemical structures of the database molecules in SMILES, molfile, and InChI formats.
  • molecule_dictionary - maps molecule identifiers between different tables
The function search_chembl_for_compound then extends the SQL above with 
WHERE molecule_dictionary.pref_name LIKE ? COLLATE NOCASE

This combined SQL will then perform a case insensitive search (specified by COLLATE NOCASE) on the joined tables to find drugs with a specific name (e.g. Desonide).  

As Brian points out in his README file, ChEMBL was able to associate structures with 1504 of the 1520 compounds.  In order to locate the chemical structures for the remaining 16 molecules, Brian used the NIH Chemical Identifier Resolver, a web service for converting names to chemical structures.  Using this approach, he was able to assign structures to 14 of the 16 names not listed in the ChEMBL database.   Brian's README describes the additional detective work necessary to complete the dataset. 

In addition to the valuable public service provided by the data, Brian's code presents a very useful approach to associating chemical structures with the names of drugs or other molecules.  Thanks, Brian! 



Comments

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)

Getting Real with Molecular Property Prediction