Showing posts from September, 2018

Assigning Bond Orders to PDB Ligands - The Easy Way

In this post, I'll walk through how we can combine a couple of Open Source software tools to easily and reliably assign bond orders to ligands from protein-ligand complexes from the PDB.  As usual, the associated code is in GitHub .  One of the many things that frustrate me about the PDB file format is the absence of bond order information (please don't talk to me about double CONECT records).  Since the bond order information is missing, we typically have to assign the bond orders, either manually or algorithmically.  Anyone who has tried to implement bond order perception from PDB files will tell you that it's a difficult problem.  For a detailed explanation of what's necessary, take a look at this 2001 talk from Roger Sayle .  The problem is confounded by the fact that the geometry of many of the ligands in the PDB is less than ideal.  For a detailed explanation of the many issues with PDB ligand geometries, take a look a Greg Warren's work on creating t

Some Notes From the 2018 RDKit UGM

Last week I had the pleasure of attending the RDKit User Group meeting in Cambridge, UK.  This was my first RDKit UGM, and it was great.  I had the opportunity to catch up with a lot of people I hadn’t seen for a while and learned about a lot of exciting Open Source Cheminformatics. In this post, I’ve tried to summarize some of what took place and to present some links to relevant software and literature.  This won’t be a complete recitation of everything that took place, but hopefully, it will provide an overview for those who’d like to dig deeper.  I’ll link the slide decks as they become available.  Please let me know if I’ve missed or misinterpreted anything. Slides from the meeting are available in GitHub Wednesday, September 19th Greg Landrum, KNIME/T5 Informatics, Welcome and Intro ( slides ) Greg provided a bit of history of the RDKit as well as an intro to some of the newer features. C++ code has been modernized to C++ 14, greatly s

A Few Updates to Free-Wilson

Note: This post will probably only be interesting to those who are using the Free-Wilson package I released on GitHub. I made a few updates to the Free Wilson package. 1.  Dramatically reduced the memory usage for the enumeration phase.  Originally I was just saving the enumerated products to memory and writing them out at the end.  Who knew that someone would want to enumerate 14 million products?  The script now writes every 1000 structures to disk.  There shouldn't be any memory issues, even with the largest enumerations. 2. The script now properly handles cases where the same substituent connects to multiple R-groups.  This will be the case when a cycle connects two R-group positions.  This isn't really valid for Free-Wilson, so I set the script up to simply skip cases like this.  The script reports molecules that are skipped, so at least you'll have an indication of what happened. 3. The script has a new flag "--smarts" in the "rgroup" an

Predicting Aqueous Solubility - It's Harder Than It Looks

In this post, we're going to talk about aqueous solubility and how we can predict it.  We'll explore a few predictive models and talk about the best ways to compare these models.  We will compare ESOL , a simple empirical method that's been in existence for almost 15 years, with a few machine learning models generated with DeepChem .  As usual, all of the code associated with this post is on GitHub . There was a recent post on the RDKit mailing list where someone was looking for an implementation of the ESOL solubility prediction method, originally published by John Delaney in 2004.  I've used ESOL in the past and found it somewhat useful.  I didn't have an implementation handy, so I threw some code together and put an RDKit implementation on my GitHub site .  Now that I have this code in hand, it gives me the opportunity to talk about a couple of topics that I think are under appreciated. Selecting an appropriate test set Comparing regression methods A Bit o