Plotting Distributions

In Cheminformatics, we often deal with distributions.  We may want to look at the distributions of biological activity in one or more datasets.  If we're building predictive models, we may want to look at the distributions of prediction errors.   A recent Twitter thread on "dynamite plots", and the fact that they should be banned from all scientific discourse got me thinking about a few different ways to look at distributions.  I thought it might be useful to put together some code examples showing how one can plot distributions.   As usual, all of the code I used for analysis and plotting can be found on GitHub.

Dynamite Plots (don't use these)
To demonstrate this concept, we'll look at the dataset published by Lee et al based on an earlier publication by Wang et al, that I used in my last post.  This dataset consists of molecules which had a range of activities against eight different targets.  One way to plot this data is to use the classic dynamite plot.   These plots, which were named for the similarity to the dynamite detonators depicted in Roadrunner cartoons, show the mean of a distribution as a bar plot and the standard deviation as a whisker above the plot.




We've all seen these plots or the even worse case where the authors simply display a bar plot displaying the mean with no error bars.  As you'll see if you do a Google search for "dynamite plot", many statisticians and data geeks consider these plots to be an abomination.  There is a lot not to like about dynamite plots.
  • They only show the mean and standard deviation
  • They don't tell you anything about how the data is distributed
  • They only show the upper whisker, which makes comparisons difficult 
Box Plots
One alternative to the dynamite plot is the box plot.  Box plots show a distribution as a box, which represents the middle 50% of the distribution, with whiskers that define the limits of the data.  The line in the middle of the box typically represents the median of the distribution.  The distance from the bottom to the top of the box is referred to as the interquartile range (IQR).  The box is extended with whiskers that define 1.5 * IQR above and below the top and bottom of the box.  Any points outside the range defined by the whiskers are considered outliers and typically drawn as points.   As you can see below, the box plot provides much more information on the overall activity distributions.


Beeswarm Plots
An alternative to the box plot is the beeswarm plot.  In the beeswarm plot, all of the data points are plotted in sets according to their y-values.  If multiple points have the same value, they are arrayed horizontally.   One advantage to the beeswarm plot is the ability to use the color to show an additional property.  For instance, we could use a beeswarm plot to display the pIC50 of the compounds, then use the color to display the fold error in the predicted pIC50.




Violin Plots
Box plots tell us a lot more about how the data is distributed, but we could still be missing key parts of the picture.  The box shows us where the middle 50% of the distribution and limits are but it doesn't tell us things like whether the data has a bimodal distribution.  One way to look at these distributions is to use a violin plot.   A violin plot is similar to a box plot, but instead of a box and whiskers, the data is plotted using a distribution known as a kernel density estimate (KDE).  A KDE creates a smoothed curve that provides an estimate of the probability density of a distribution.  One can think of a violin plot as a KDE that has been turned on its side and mirrored.  The violin plot also shows a box plot at its center.



Ridge Plots
From the plot above, we can see that the pIC50 distributions for most of the targets appear Gaussian, but the distribution for cdk2, mcl1, and tyk2 seem somewhat bimodal.  While violin plots are useful, many people, including me, find them visually overwhelming.   Most of us are more used to looking at distributions that are displayed horizontally rather than vertically.

One visualization that has been receiving a lot of attention, the ridge plot, attempts to address these limitations of the violin plot by arraying a set of overlapping KDEs in a vertical grid.  I find the plot below the easiest to process.   For me, this plot really highlights the differences in the distributions.




There is no single ideal plot for distributions.  A lot of this depends on the aspects of the data that you want to examine or highlight.   Hopefully, this post, and the corresponding code will provide some new ways for you to look at your data. 

Comments

Popular posts from this blog

We Need Better Benchmarks for Machine Learning in Drug Discovery

AI in Drug Discovery 2023 - A Highly Opinionated Literature Review (Part I)

Getting Real with Molecular Property Prediction