Digging Deeper into Thompson Sampling - A Guest Blog Post by Patrick Riley
This week is a special Pat substitution. Pat(rick) Riley is taking over this blog post for a follow up on our Thompson Sampling paper.
In our recent paper Thompson Sampling─An Efficient Method for Searching Ultralarge Synthesis on Demand Databases we showed how you could use the classic Thompson Sampling algorithm to select each reagent in a combinatorial library to conduct an efficient search for a variety of scoring functions. Eagle-eyed readers probably noticed that for ROCS, we presented results searching 0.1% of the library and for docking we searched 1% because "docking with TS requires more sampling than ... ROCS".
But why is docking harder for a Thompson Sampling based search than ROCS?
This post gives an answer to that question.
A Visual Version
Remember that in Thompson Sampling, for each component in a reaction, you track statistics for each possible reagent. Each reagent is associated with a distribution of scores for complete molecules because the reagent can be mixed with many reagents for the other reaction components.
Since we have the complete enumeration of scores for the libraries for both docking and ROCS, we can calculate the true distribution of scores for each reagent, which we summarize as a mean and standard deviation. Since our complete enumeration contains 200M-300M total molecules, with 14k-18k reagents in each component, you don't want to load all the data into memory. Fortunately, Welford's online algorithm is a great way to compute the running mean and standard deviation with constant storage. This is an algorithm anyone processing large amounts of data should have in their back pocket.
To make things more comparable between ROCS and docking scores, I'll present the mean standardized in the normal way (using the mean and standard deviation of all scores). I'll also "standardize" the standard deviation by just dividing the standard deviation of the reagent by the standard deviation of all scores. That means if every reagent has exactly the standard deviation of all the data, then you would get 1.0 for the "standardized" standard deviation of every reagent. If a reagent has a "standardized" standard deviation larger than 1.0, the reagent has more variability than the overall set of scores.
Below is a scatter plot where each reagent is a point located by its mean and standard deviation. Since we care about finding the top molecules, the reagents of the true top 100 molecules are marked in red. There are two sets of plots because these are two component reactions (which we also call "cycles").
Thompson Sampling favors selecting reagents with higher mean (i.e. the right side of this plot). It's quite obvious that the reagents that make the top 100 molecules are more "mixed in" with the crowd of other reagents for docking compared to ROCS. This makes it harder to pick the right reagents, especially the right combination of reagents. In other words, for ROCS, the reagents of the best compounds tend to be the best reagents on average and that's less true of docking.
An interesting aside: you may also note that there is more variation in the standard deviation for ROCS compared to docking. Our paper uses Bayesian updates that assume we know the standard deviation of each reagent and the standard deviation is the same for all reagents. The assumption appears more true for docking. What implication this has on the effectiveness of Thompson Sampling is unknown.
A Quantitative Version
If you are unsatisfied with the squishiness of the claim that the best reagents are "more mixed in" for docking and want to make that quantitative, this section is for you.
Imagine a Thompson Sampling like oracle. It knows, for every reagent, the true mean and standard deviation for that reagent when combined with all others. The oracle samples from a Normal distribution with that mean and standard deviation to pick reagents at each step (similar to how Thompson Sampling does it). What’s the expected number of attempts for this oracle to find each of the top molecules?
The accompanying Jupyter notebook (section “Mathematical definition of the difficulty”) gives the full mathematical derivation, but it turns out to be fairly easy to compute this (with just a little bit of numeric care involving log probabilities, logsumexp, and numerically approximating an integral). The plot below compares the top ROCS and docking molecules with this metric.
The differences are quite stark in this quantitative version. Most of the top molecules for ROCS can be found by the oracle in 1,000 to 10,000 attempts. For docking, many of the top molecules take more like 1,000,000 attempts and some even 10,000,000,000 attempts. I believe any combinatorial search algorithm which is calculating utility for each reagent will struggle with this inherent difference in how consistently each reagent performs to find the top scoring molecules.
Final Notes and Caveats
More plots and technical details can be found in the associated notebook. I believe this quantitative metric could help understand the relative difficulty of libraries with varying number of cycles or reagents per cycle. Figuring out how to approximate this metric without doing a complete enumeration of the scores would be an interesting next step.
We are attributing the differences observed here to ROCS vs. docking, but note that these were done with different chemical libraries. The sizes are: ROCS has 230M molecules with reagent sizes of 14k and 17k and docking has 330M molecules with reagent sizes 18k and 19k. Some of the difference above is probably attributable to the slightly larger size for docking, but it's probably not a large effect.
Thank you Pat for letting me invade your blog for a week!
I wonder if this would change if you included the molecules that didn't dock. It's not clear what you should set the docking scores for those, but maybe it would bring down the mean for some of the reagents that look like the best in your plot but which don't make up the best molecules (although probably does something odd to the standard deviations).
ReplyDelete