Assigning Bond Orders to PDB Ligands

In this post, I'll walk through how we can combine a couple of Open Source software tools to easily and reliably assign bond orders to ligands from protein-ligand complexes from the PDB. As usual, the associated code is in GitHub.

One of the many things that frustrate me about the PDB file format is the absence of bond order information (please don't talk to me about double CONECT records). Since the bond order information is missing, we typically have to assign the bond orders, either manually or algorithmically. Anyone who has tried to implement bond order perception from PDB files will tell you that it's a difficult problem. For a detailed explanation of what's necessary, take a look at this 2001 talk from Roger Sayle. The problem is confounded by the fact that the geometry of many of the ligands in the PDB is less than ideal. For a detailed explanation of the many issues with PDB ligand geometries, take a look a Greg Warren's work on creating the IRIDIUM database.

Fortunately, there is a relatively easy solution to assigning bond orders for any ligand in the PDB. We already know the answer so we can cheat. How, you ask, do we know the answer? The wonderful people at the PDB maintain something called Ligand Expo, which is a dictionary of SMILES strings for all of the small molecules in the PDB. Since someone (perhaps magical elves) already assigned the bond orders to the ligand, it's just a matter of matching the SMILES string for the ligand to the ligand molecule extracted from the PDB and copying the bond orders. Extracting the ligand from the PDB file can be done with the Prody Python Library and matching the SMILES string to PDB ligand molecule can be done with the RDKit.

All we have to do is write a bit of code to stitch this together. In the rest of this post, I'm going to walk through the code. If you want to skip that and just grab the code, the gist is here.

Before we get started, we need to install a few necessary prerequisites, the RDKit, and the ProDy library, and the requests library. The installation instructions for the RDKit can be found here. Prody requires BioPython, but you can install both of these, as well as the requests library with pip.

pip install biopython prody requests

Once everything is installed, we just need to import a few libraries.

import sys
from prody import *
import pandas as pd
from rdkit import Chem
from rdkit.Chem import AllChem
from io import StringIO
import requests

The ProDy library allows us to process proteins. We will use Pandas to read a tab separated file and convert it to a Python dictionary. Most of the small molecule processing will be done with the RDKit, and we will use the requests library to handle a file download.

Now that the imports are taken care of, the first thing we need to do is grab the Component Dictionary from the PDB. This file contains, among other things, the identifier and SMILES strings for all of the non-protein components in the PDB. We can find the identifier for our ligand(s), then look up the SMILES string. We don't want to be jerks and hit the PDB servers every time, so we'll check to see if we already have the file. If we don't have the file, we'll download it from the Ligand Expo site. Once we have the file, we'll use Pandas to read the file from disk and convert it to a Python dictionary.

def read_ligand_expo():
    """
    Read Ligand Expo data, try to find a file called
    Components-smiles-stereo-oe.smi in the current directory.
    If you can't find the file, grab it from the RCSB
    :return: Ligand Expo as a dictionary with ligand id as the key
    """
    file_name = "Components-smiles-stereo-oe.smi"
    try:
        df = pd.read_csv(file_name, sep="\t",
                         header=None, 
                         names=["SMILES", "ID", "Name"])
    except FileNotFoundError:
        url = f"http://ligand-expo.rcsb.org/dictionaries/{file_name}"
        print(url)
        r = requests.get(url, allow_redirects=True)
        open('Components-smiles-stereo-oe.smi', 'wb').write(r.content)
        df = pd.read_csv(file_name, sep="\t",
                         header=None, 
                         names=["SMILES", "ID", "Name"])
    df.set_index("ID", inplace=True)
    return df.to_dict()

Now that we have the Component Dictionary, we can read the PDB file from disc and find the ligand. All of our PDB file reading and processing will be done with the awesome ProDy library. One of the many useful things in ProDy is a set of routines for fetching PDB files. If you request a PDB file with the parsePDB function, ProDy will first check in the current working directory to see if the file exists. If the file is there, it will be loaded. For instance, if I call parsePDB("1bmk")the script will first check to see if "1bmk.pdb" exists in the current directory. If the file exists, it will be loaded. However, if the file doesn't exist in the current directory, ProDy will download a copy from the PDB to your working directory and print a message to tell you what it did. Once ProDy has read the PDB file, we can use its query language to extract the ligands. The ligand variable below is actually an object that we can query to get each of the ligands, more on that in a second.

def get_pdb_components(pdb_id):
    """
    Split a protein-ligand pdb into protein and ligand components
    :param pdb_id:
    :return:
    """
    pdb = parsePDB(pdb_id)
    protein = pdb.select('protein')
    ligand = pdb.select('not protein and not water')
    return protein, ligand

Ok, here's the best part. Now that we have the ligand from ProDy and the SMILES for that ligand from Ligand Expo, we can use ProDy to extract a ligand component and assign bond orders with the RDKit. There is one sort of tricky bit here where I use StringIO to avoid writing a file to disc. We could get the ligand from ProDy, write it to disc, then have the RDKit read the PDB file, but that would be inelegant. The StringIO library allows us to write the PDB to a stream with ProDy's writePDBStream, then have the RDKit read that stream with AllChem.MolFromPDBBlock.

def process_ligand(ligand, res_name, expo_dict):
    """
    Add bond orders to a pdb ligand
    1. Select the ligand component with name "res_name"
    2. Get the corresponding SMILES from the Ligand Expo dictionary
    3. Create a template molecule from the SMILES in step 2
    4. Write the PDB file to a stream
    5. Read the stream into an RDKit molecule
    6. Assign the bond orders from the template from step 3
    :param ligand: ligand as generated by prody
    :param res_name: residue name of ligand to extract
    :param expo_dict: dictionary with LigandExpo
    :return: molecule with bond orders assigned
    """
    output = StringIO()
    sub_mol = ligand.select(f"resname {res_name}")
    sub_smiles = expo_dict['SMILES'][res_name]
    template = AllChem.MolFromSmiles(sub_smiles)
    writePDBStream(output, sub_mol)
    pdb_string = output.getvalue()
    rd_mol = AllChem.MolFromPDBBlock(pdb_string)
    new_mol = AllChem.AssignBondOrdersFromTemplate(template, rd_mol)
    return new_mol

At this point, we just need a couple of functions to handle the writing of files,

def write_pdb(protein, pdb_name):
    """
    Write a prody protein to a pdb file
    :param protein: protein object from prody
    :param pdb_name: base name for the pdb file
    :return: None
    """
    output_pdb_name = f"{pdb_name}_protein.pdb"
    writePDB(f"{output_pdb_name}", protein)
    print(f"wrote {output_pdb_name}")


def write_sdf(new_mol, pdb_name, res_name):
    """
    Write an RDKit molecule to an SD file
    :param new_mol:
    :param pdb_name:
    :param res_name:
    :return:
    """
    outfile_name = f"{pdb_name}_{res_name}_ligand.sdf"
    writer = Chem.SDWriter(f"{outfile_name}")
    writer.write(new_mol)
    print(f"wrote {outfile_name}")

and the main program to tie everything together.

def main(pdb_name):
    """
    Read Ligand Expo data, split pdb into protein and ligands,
    write protein pdb, write ligand sdf files
    :param pdb_name: id from the pdb, doesn't need to have an extension
    :return:
    """
    df_dict = read_ligand_expo()
    protein, ligand = get_pdb_components(pdb_name)
    write_pdb(protein, pdb_name)

    res_name_list = list(set(ligand.getResnames()))
    for res in res_name_list:
        new_mol = process_ligand(ligand, res, df_dict)
        write_sdf(new_mol, pdb_name, res)


if __name__ == "__main__":
    if len(sys.argv) == 2:
        main(sys.argv[1])
    else:
        print("Usage: {sys.argv[1]} pdb_id", file=sys.stderr)

If we save the code from the gist as split_complex.py and run it with the command below, this is what we'll see.

split_complex.py 1bmk
@> Connecting wwPDB FTP server RCSB PDB (USA).
@> 1bmk downloaded (1bmk.pdb.gz)
@> PDB download via FTP completed (1 downloaded, 0 failed).
@> 2950 atoms and 1 coordinate set(s) were parsed in 0.03s.
wrote 1bmk_protein.pdb
[09:23:27] WARNING: More than one matching pattern found - picking one
wrote 1bmk_SB5_ligand.sdf

The message
WARNING: More than one matching pattern found - picking one
Appears because the fluorophenyl group is symmetric and can match multiple ways. This isn't something we have to worry about.

That's it, simple, easy, hope it helps you!

Note: Shortly after publishing this post I got a comment from Sung-Hun Bae that pointed out how I could use the very cool pypdb library to get the SMILES for the ligand. This simplifies the code quite a bit. I put a gist with the updated code here.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Search This Blog

Practical Cheminformatics

Assigning Bond Orders to PDB Ligands - The Easy Way

Comments

Post a Comment

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

Silly Things Large Language Models Do With Molecules

Generative Molecular Design Isn't As Easy As People Make It Look