A Few Updates to Free-Wilson
Note: This post will probably only be interesting to those who are using the Free-Wilson package I released on GitHub.
I made a few updates to the Free Wilson package.
1. Dramatically reduced the memory usage for the enumeration phase. Originally I was just saving the enumerated products to memory and writing them out at the end. Who knew that someone would want to enumerate 14 million products? The script now writes every 1000 structures to disk. There shouldn't be any memory issues, even with the largest enumerations.
2. The script now properly handles cases where the same substituent connects to multiple R-groups. This will be the case when a cycle connects two R-group positions. This isn't really valid for Free-Wilson, so I set the script up to simply skip cases like this. The script reports molecules that are skipped, so at least you'll have an indication of what happened.
3. The script has a new flag "--smarts" in the "rgroup" and "all" modes that enable proper handling of symmetric R-groups. As an example, let's look at the dataset in CHEMBL3638592. The molecules look like this.
If we define a scaffold like the one below, we have two symmetrically equivalent R-groups.
Note that sometimes R1 contains the alkyl substituent and sometimes it contains the aryl substituent. Fortunately, with this dataset, we have an easy way to distinguish our desired R1 and R3. If you look through the structures, you'll notice that there is a small alkyl substituent and an aryl substituent. We could use a very simple SMARTS pattern "c" to distinguish R1 from R3. Let's say that we want to use R1 for the alkyl substituent and R3 for the aryl substituent. We could just run the Free-Wilson script in the "rgroup" mode like this.
free_wilson.py rgroup --scaffold CHEMBL3638592_scaffold.mol --in CHEMBL3638592.smi --prefix CHEMBL3638592 --smarts "3|c"
The command is the same as before, the only addition is the "--smarts" flag. The "--smarts" flag is followed by a specification, which should be in quotes. The specification consists of two parts separated by "|". The first part specifies the R-group number, 3 is R3. The second part is the SMARTS, you should be able to make this as elaborate as you want.
Of course, we aren't always going to have a simple situation where an aromatic carbon will distinguish one of two symmetric substituents. What if we want to select the aliphatic substituent? This is doable, but it's a bit more tricky. In this case, we're going to have to use a slightly more sophisticated SMARTS. If we look at the molecules, we can see that the alkyl substituent is either a methyl group or an ethyl group. Once we've done the R-group decomposition, this methyl or ethyl group will be attached to a dummy atom designated by a "*". Let's look at the SMARTS pattern that we can use to define "methyl or ethyl attached to a dummy atom", then we can break it down and understand it. The pattern is shown below.
4. Support for a restricted enumeration. Of course, in most cases you probably don't want to enumerate 14 million molecules, you just want to enumerate the "best" molecules. The "all" and "enumeration" modes now support a new "--max" flag. If you specify this flag, the enumeration will only be carried out with a user-specified number of substitutients. This is best illustrated through an example. The example file CHEMBL3638592.smi in the data directory contains 2 R1, 5 R2, and 503 R3 substituents. Let's say that we only want to enumerate using the 10 highest scoring R3 substituents. We can then use this command line.
free_wilson.py all --scaffold CHEMBL3638592_scaffold.mol --in CHEMBL3638592.smi --prefix chembl --act CHEMBL3638592_act.csv --smarts "3|c" --log --max "a|2,3,10"
The key here is the last argument on the command line --max "a|2,3,10". This specifies that the enumeration should use 2 R1, 3 R2, and 10 R3 substituents. The best substituents are chosen based on the regression coefficients, as specified in the file prefix_coeffients.csv. The "a" before the "|" specifies that the script should sort the R-group contributions in ascending order. Since the data in CHEMBL3638592_act.csv consists of IC50 values, and lower values are better, we sort the data in ascending order. If we had something like pIC50 data, where larger values are better, we would use "d" to indicate that the data should be sorted in descending order and the command line would look like this. Note that if you are using the --log flag that converts data to a log scale you should also use the "d" option.
free_wilson.py all --scaffold CHEMBL3638592_scaffold.mol --in CHEMBL3638592.smi --prefix chembl --act CHEMBL3638592_act.csv --smarts "3|c" --log --max "d|2,3,10"
Note that the numbers following "|" specify the maximum number to be enumerated. So, in this case, since we only have 2 R1 and 5 R2, the flags --max "a|2,3,10" and --max "a|10,10,10" would have the same effect.
My thanks to Emanuele Perola for useful suggestions and debugging assistance.
This work is licensed under a Creative Commons Attribution 4.0 International License.
I made a few updates to the Free Wilson package.
1. Dramatically reduced the memory usage for the enumeration phase. Originally I was just saving the enumerated products to memory and writing them out at the end. Who knew that someone would want to enumerate 14 million products? The script now writes every 1000 structures to disk. There shouldn't be any memory issues, even with the largest enumerations.
2. The script now properly handles cases where the same substituent connects to multiple R-groups. This will be the case when a cycle connects two R-group positions. This isn't really valid for Free-Wilson, so I set the script up to simply skip cases like this. The script reports molecules that are skipped, so at least you'll have an indication of what happened.
3. The script has a new flag "--smarts" in the "rgroup" and "all" modes that enable proper handling of symmetric R-groups. As an example, let's look at the dataset in CHEMBL3638592. The molecules look like this.
If we define a scaffold like the one below, we have two symmetrically equivalent R-groups.
So what's going to happen when we do an R-group decomposition on the molecules in CHEMBL3638592? Will the alkyl substituent end up at R1 or R3? Unfortunately, it's going to be completely random, which is not good. Ok, technically it's not going to be random, it's going to depend on the atom ordering in the input, but you get what I mean. We are going to end up in a situation like the one shown below.
free_wilson.py rgroup --scaffold CHEMBL3638592_scaffold.mol --in CHEMBL3638592.smi --prefix CHEMBL3638592 --smarts "3|c"
The command is the same as before, the only addition is the "--smarts" flag. The "--smarts" flag is followed by a specification, which should be in quotes. The specification consists of two parts separated by "|". The first part specifies the R-group number, 3 is R3. The second part is the SMARTS, you should be able to make this as elaborate as you want.
Of course, we aren't always going to have a simple situation where an aromatic carbon will distinguish one of two symmetric substituents. What if we want to select the aliphatic substituent? This is doable, but it's a bit more tricky. In this case, we're going to have to use a slightly more sophisticated SMARTS. If we look at the molecules, we can see that the alkyl substituent is either a methyl group or an ethyl group. Once we've done the R-group decomposition, this methyl or ethyl group will be attached to a dummy atom designated by a "*". Let's look at the SMARTS pattern that we can use to define "methyl or ethyl attached to a dummy atom", then we can break it down and understand it. The pattern is shown below.
[#0;$([#0][CH3]),$([#0][CH2][CH3])]
The SMARTS pattern above is a recursive SMARTS. In a recursive SMARTS, the "anchor" atom of the pattern is defined first and followed by ";", the remainder of the pattern, enclosed in "$()" defines specific atom environments for the anchor atom. The comma between the two relationships defined inside "$()" indicates an "or" relationship. If we had wanted an "and" relationship, we could have used ";". We can break down this pattern as shown below. For more information on the wonders of SMARTS patterns, interested readers are urged to consult the Daylight Theory Manual.
If we run with the command shown below, all of the alkyl substituents will end up in R3.
free_wilson.py rgroup --scaffold CHEMBL3638592_scaffold.mol --in CHEMBL3638592.smi --prefix CHEMBL3638592 --smarts "3|[#0;$([#0][CH3]),$([#0][CH2][CH3])]"
If we run with the command shown below, all of the alkyl substituents will end up in R3.
free_wilson.py rgroup --scaffold CHEMBL3638592_scaffold.mol --in CHEMBL3638592.smi --prefix CHEMBL3638592 --smarts "3|[#0;$([#0][CH3]),$([#0][CH2][CH3])]"
4. Support for a restricted enumeration. Of course, in most cases you probably don't want to enumerate 14 million molecules, you just want to enumerate the "best" molecules. The "all" and "enumeration" modes now support a new "--max" flag. If you specify this flag, the enumeration will only be carried out with a user-specified number of substitutients. This is best illustrated through an example. The example file CHEMBL3638592.smi in the data directory contains 2 R1, 5 R2, and 503 R3 substituents. Let's say that we only want to enumerate using the 10 highest scoring R3 substituents. We can then use this command line.
free_wilson.py all --scaffold CHEMBL3638592_scaffold.mol --in CHEMBL3638592.smi --prefix chembl --act CHEMBL3638592_act.csv --smarts "3|c" --log --max "a|2,3,10"
The key here is the last argument on the command line --max "a|2,3,10". This specifies that the enumeration should use 2 R1, 3 R2, and 10 R3 substituents. The best substituents are chosen based on the regression coefficients, as specified in the file prefix_coeffients.csv. The "a" before the "|" specifies that the script should sort the R-group contributions in ascending order. Since the data in CHEMBL3638592_act.csv consists of IC50 values, and lower values are better, we sort the data in ascending order. If we had something like pIC50 data, where larger values are better, we would use "d" to indicate that the data should be sorted in descending order and the command line would look like this. Note that if you are using the --log flag that converts data to a log scale you should also use the "d" option.
free_wilson.py all --scaffold CHEMBL3638592_scaffold.mol --in CHEMBL3638592.smi --prefix chembl --act CHEMBL3638592_act.csv --smarts "3|c" --log --max "d|2,3,10"
Note that the numbers following "|" specify the maximum number to be enumerated. So, in this case, since we only have 2 R1 and 5 R2, the flags --max "a|2,3,10" and --max "a|10,10,10" would have the same effect.
That's it for now, I have a few more changes planned for the Free-Wilson scripts and will report back when those are complete. Hopefully, the current changes will make the Free-Wilson code a bit more flexible and robust.
My thanks to Emanuele Perola for useful suggestions and debugging assistance.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Comments
Post a Comment