Building on the Fragments From the Diamond/XChem SARS-CoV-2 Main Protease (MPro) Fragment Screen (Part II) Structure-Base Evaluation of Expanded Fragments

In my previous post, I showed how we could generate a bunch of ideas for expanding on one of the fragment hits from the screen of the SARS-CoV-2 Main Protease (MPro) conducted at the Diamond Light Source. In this post, we'll go through one method of rapidly evaluating these ideas.  In my previous posts, I've tried to use Open Source software wherever possible.  Here I'm going to make an exception and write all of the code using toolkits from OpenEye Scientific Software.  I'm doing this for a couple of reasons.

  • I've been using some of these toolkits for 20 years.  I know them well, and it makes it easy for me to write and test the methodology. 
  • I don't think that some of what I'm doing is possible, or at least easy to do, with Open Source software.  If someone has a similar Open Source workflow, please let me know.  I'd love to see it and try it. 
  • A lot of people in the community are using the OpenEye toolkits and can hopefully use, and improve upon, the workflow I've developed. 
I'll start out by explaining the workflow, then we'll go through an example using the scripts I put together.  This is by no means a finished product.  As I'll point out along the way, there are multiple areas where this workflow could be improved.  I'm hoping that this post will inspire others to build on the workflow and make it better.  

As I mentioned in my previous post, I'm using this workflow to generate ideas for how a fragment hit could be expanded.  I'm not expecting the software to generate the perfect molecule.  I just want a tool that will help me to identify interactions and ways that I can modify a molecule to make those interactions.  These ideas can then prompt me to do a number of things, including,
  • Designing a different, often simpler, molecule that will make the same interaction. 
  • Performing substructure or similarity searches for molecules that will make the same interactions.
  • Doing a 3D similarity search with ROCS or a similar tool to identify purchasable molecules that make the same interactions. 
I know people who can just look at a fragment bound in a pocket and immediately know how to modify it to make it more potent.  Unfortunately, I'm not one of those people.  I like to use approaches like the one described here to give me ideas for how to improve a molecule. 

The workflow here is pretty simple. We begin by generating a set of expanded fragments using an approach like the one outlined in my previous post.  To evaluate these expanded fragments, we follow a short series of steps for each molecule. 

As a first step, we generate a set of conformers for each expanded fragment (EF).  In this step, the base fragment (BF) is fixed and a conformational search is performed on the remainder of the molecule.  For example, let's assume that we are starting with BF below. 
Base Fragment (BF)
Now let's assume that we want to generate conformers for EF below holding BF fixed. 
Expanded Fragment (EF)

If we use the Omega toolkit and specify BF as the fixed portion BF is held fixed and conformations for the remainder of the molecule are explored.   In the figure below, we see the conformational ensemble generated in the first step of our workflow. 

The cool part of this is that, since we kept BF fixed, the conformers are already aligned in the binding site. 

Of course, as we can see above, most of these conformers don't actually fit in the binding site.  Many ligand atoms clash with, or penetrate, the protein surface. We need a way of rapidly determining which of these conformers actually fit. Our structure triage employs a multi-step approach, which starts with a rapid bump check.  In this step, we check each conformer to see how many ligand atoms are within 2A of a protein atom. This check is very fast, especially if we use the OENearestNbrs method which indexes the protein atoms and enables rapid searching. 

By default, we eliminate any conformer with a heavy atom that is within 2A of any protein heavy atom.  The conformers that pass this bump filter are then subjected to a rigid body minimization.  During this minimization, the protein and ligand conformations are held fixed and the ligand position is minimized.  After minimization, we calculate the root mean squared deviation (RMSD) between the crystallographic position of BF and the position of the corresponding atoms in EF. If the BF portion of EF moves more than a predefined cutoff (default 1A) as a result of the minimization, the structure is rejected.  Finally, the conformers that survive the bump check and RMSD check are scored and the conformer, score, and RMSD are stored.  

An Example 
Let's take a look at how this works in practice.  All of the code and data files I used in this example are available on GitHub. We begin with the script  This script generates conformers while holding a base fragment fixed. 

./ -h
Usage: --smi SMILES_FILE --fix FIX_FILE --out OUTPUT_FILE

--smi SMILES_FILE input SMILES file name
--fix FIX_FILE file_with_fixed piece of the molecule
--out OUTPUT_FILE output file name

The script takes three arguments, an input SMILES file, a file containing the fragment to be fixed, and an output file.  Using the example input in the data directory from the repo on GitHub we can run this command. 

./ --smi ../data/x0161.smi --fix ../data/x0161_lig.sdf --out ../data/x0161_confs.sdf

It takes about 20 seconds to run this command on my laptop.  You'll see a couple of warning messages because appropriate protonation states can't be assigned.  Don't worry about these, the script that generates the expanded fragments can sometimes output molecules that can't be processed by the OpenEye Quacpac toolkit.   These molecules are skipped.  The output of the command is an SD file with 34,072 confomers. 

Now we need to evaluate the conformers that we just generated.  In order to do this, we use the script This script evaluates each conformer in the context of the protein binding site.  

/ -h

--prot PROTEIN_FILE protein file
--lig LIGAND_FILE ligand file to be positioned in the active site
--ref REFERENCE_FILE reference ligand file for comparison
--out OUTPUT_FILE output file name
--bump BUMP_CUTOFF Number of allowed protein-ligand contacts within 2A (default 0)
--rms RMS_CUTOFF only keep structures with with RMS < CUTOFF after minimization (default 1.0 A)
--top TOP_OUT number of structures to write (default 100)

The first four arguments are the protein file, the ligand conformer file generated by, the reference file (used to cacluate RMS), and the output file.  The script also takes three optional arguments that allow the user to specify the bump cutoff, rms_cutoff, and number of structures to output.  As an example, we can run this command. 

./ --prot ../data/x0161_prot.pdb --lig ../data/x0161_confs.sdf --ref ../data/x0161_lig.sdf --out ../data/x0161_out.sdf

This command takes about 17 minutes to run on my laptop.  It looks like there were a few compounds containing Boron that caused the charge assignment routines to throw some warning messages.  Looks like I need to add a filtering rule to catch non-standard atom types. 

Results and Discussion
Now that we've evaluated the molecules in the binding site, we can take a look at some of  the top 100 results. Let's examine some of the structures and see what we can learn. 

In the figure above, our posed structure is shown in green and the original crystal structure (BF) of the fragment is shown in grey.  If we look at the file "scores.csv", we can see that the RMS indicates that the movement was slightly less than 1A. As we can see, BF hasn't moved a lot and is still maintaining the hydrogen bond to GLN 189.  We've also made an interesting cation-pi interaction with HIS 41.  The PDB file from the Diamond Light Source shows HIS 41 as being delta protonated (HID), but the software used to do the protein preparation assigned the residue as protonated (HIP).  I think I'd look at this one more carefully before I took any action.  


In this case, the basic nitrogen in the pyrrolidine is forming a salt bridge with GLU 166. This interaction is somewhat close to solvent so we might want to take that one with a grain of salt (pun intended).  All of these molecules make me want to restrict the number of rotatable bonds in the expanded fragments.  The entropic cost of these new interactions is probably going to be pretty high. 


This is another very flexible molecule, but it provides an idea that I wouldn't have thought of otherwise.  We have a pi stacking interaction between the acetylene in the ligand and HIS 41.  This is somewhat unusual, so I'd probably look through the PDB to see if I could find some precedent for this type of interaction.  

There are a number of ways that we could further refine some of these ideas.  
  • Minimize the system allowing protein and ligand flexibility
  • Carry out a molecular dynamics simulation to look at pose stability
  • Look for overlap between the generated structures and other fragments from the Diamond set
  • Look for opportunities to introduce rings and reduce the flexibility of the system
What I've demonstrated here isn't a magic approach to a drug.  It's just one of many approaches I use to generate ideas for how a structure can be modified.  I hope that someone finds this description and the accompanying code helpful.  Please don't hesitate to get in touch if you have questions or suggestions for how to improve this workflow. 


  1. Nice! In the last example, MOL0109_1, I found that His is double-protonated, implying a very acidic local environment. It won't affect the pi stacking, but I would have thought this Xtal structure is from PH=7.

  2. Hi Pat, is this a typo? "the protein and ligand are held fixed and the ligand is minimized" Or do you mean something subtle here with how you're doing the minimization?

    1. Not a typo, but poorly worded. I changed this to read "the protein and ligand conformations are held fixed and the ligand position is minimized"

    2. Ah, a rigid body minimization of the ligand!

  3. Thank you for the second part of the story!
    I do not see much obstacles to implement this workflow with open-source/free tools. Which part of the workflow do you consider hard to implement? Conformers can be generated with RDKit. Rigid body minimization? Of course, this will take longer time and will require more lines of code. OE is a really nice ecosystem.
    And another question. Why do you think that software will unlikely generate a perfect molecule? Theoretically, there are no restrictions on that from the structure generation point of view. There are big issues related to proper scoring of this molecule to recognize it, but this is another story.

  4. And a short comment how we do such expansion-evaluation. We dock expanded molecules and calculate RMSD for the core part to select those ones which top poses have low RMSD. This can be done with any docking tool.

    1. It's true, you can dock the expanded fragments and check the RMS of the core. I've run into a couple of issues with that approach.
      1. It's kind of slow.
      2. Docking programs sometimes have a hard time coming up with correct poses for fragment sized molecules.

      I like the approach I outlined because it's fast and doesn't depend on the accuracy of the docking engine.

  5. Hey Pat, thanks for another great post. Thought it might be helpful to contribute a (tiny) function for constrained conformer generation using RDKit (

    This is heavily stolen from Tim Dudgeon's implementation in Squonk:


Post a Comment

Popular posts from this blog

How to (Not) Get a Job in Science

Dissecting the Hype With Cheminformatics

How (Not) to Get a Job in Science - Part 2 - The Interview