Exploring the SARS-CoV-2 Main Protease (MPro) Structures

Like many other groups around the world, we've been doing some virtual screening on the SARS Cov-2 Main Protease (Mpro).  There are now a lot of crystal structures of Mpro available.  How do we decide which structures to use for docking?  We'd like to select a diverse set of protein conformations that will enable us to explore multiple binding site interactions.  In this post, I'll provide an overview of a Jupyter notebook that my colleague Nic Pabon put together to explore the Mpro crystal structures.  While this notebook is oriented toward Mpro, the techniques discussed here can be applied to any protein.

In this notebook, Nic used the open-source, Prody toolkit to perform a number of analyses on the Mpro fragment structures recently released by the team at the Diamond Light Source.  The notebook begins with an overview of some of the basic capabilities available in Prody.

  • Reading PDB files
  • Properties (number of atoms, residues, chains)
  • Per atom properties
  • Hierarchical indexing
  • Extracting coordinates
  • Extracting backbone coordinates

In the next part of the notebook, we explore the flexibility of Mpro and use the root mean squared fluctuation (RMSF)  to identify flexible residues.   We can then compare RMSF for the structures with covalent and non-covalent ligands bound.



We can also use Prody to put the RMSF values into each atom's temperature facture field.  After this, we can color the structure by the temperature factor to indicate the more flexible regions of the protein.


One common task we carry out prior to performing a virtual screen is identifying a representative set. of binding sites for docking.  One way to do this is to cluster the structures and select a representative from each cluster.  In this example, we first calculate the root mean squared deviation between all pairs of structures. This creates a vector for each protein that contains the RMSD to each other proteins.  We then use the k-means algorithm, as implemented in scikit-learn to cluster these vectors into a predefined number of clusters.

One interesting aspect of this notebook is an interactive viewer that enables the user to examine and compare the clusters.  The animation below shows how this works.


In addition to clustering, we can also use Principal Component Analysis (PCA) to compare the structures.  In PCA, we project the binding site coordinates into two dimensions.  In this 2D plot, related structures will be close together, while binding site conformations that are different will be farther apart.  In the plot below, each point representing a protein structure is colored by cluster id, with the cluster centers represented by "X"s. 



Finally, as I mentioned above, there are new structures of Mpro being released every day.  One way to find these new structures is to perform a BLAST search.  In this notebook, Nic shows how you can use Prody to convert an Mpro structure to a sequence.  This sequence can then be used to perform the BLAST search for structural homologs.  The homolog structures can then be aligned and used for subsequent analysis an virtual screening.

As usual, the code for this post is available on GitHub.  Please leave a comment if you have questions or improvements to the code. 


Comments

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)

Getting Real with Molecular Property Prediction