Non-linear dimension reduction to diffusion maps

Here we reduce the dimensions of a trajectory into a diffusion map.

Last updated: December 2022 with MDAnalysis 2.4.0-dev0

Minimum version of MDAnalysis: 0.17.0

Packages required:

MDAnalysis ([MADWB11], [GLB+16])
MDAnalysisTests

Note

Please cite [CL06] if you use the MDAnalysis.analysis.diffusionmap.DiffusionMap in published work.

[1]:

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD
from MDAnalysis.analysis import diffusionmap
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Loading files

The test files we will be working with here feature adenylate kinase (AdK), a phosophotransferase enzyme ([BDPW09]). The trajectory DCD samples a transition from a closed to an open conformation.

[2]:

u = mda.Universe(PSF, DCD)

/home/pbarletta/mambaforge/envs/mda-user-guide/lib/python3.9/site-packages/MDAnalysis/coordinates/DCD.py:165: DeprecationWarning: DCDReader currently makes independent timesteps by copying self.ts while other readers update self.ts inplace. This behavior will be changed in 3.0 to be the same as other readers. Read more at https://github.com/MDAnalysis/mdanalysis/issues/3889 to learn if this change in behavior might affect you.
  warnings.warn("DCDReader currently makes independent timesteps"

Diffusion maps

Diffusion maps are a non-linear dimensionality reduction technique that embeds the coordinates of each frame onto a lower-dimensional space, such that the distance between each frame in the lower-dimensional space represents their “diffusion distance”, or similarity. It integrates local information about the similarity of each point to its neighours, into a global geometry of the intrinsic manifold. This means that this technique is not suitable for trajectories where the transitions between conformational states are not well-sampled (e.g. replica exchange simulations), as the regions may become disconnected and a meaningful global geometry cannot be approximated. Unlike principal component analysis, there is no explicit mapping between the components of the lower-dimensional space and the original atomic coordinates; no physical interpretation of the eigenvectors is immediately available. Please see [CL06], [dlPHHvdW08], [RZMC11] and [FPKD11] for more information.

The default distance metric implemented in MDAnalysis’ DiffusionMap class is RMSD.

Note

MDAnalysis implements RMSD calculation using the fast QCP algorithm ([The05]). Please cite [The05] if you use the default distance metric in published work.

[3]:

dmap = diffusionmap.DiffusionMap(u, select='backbone', epsilon=2)
dmap.run()

[3]:

<MDAnalysis.analysis.diffusionmap.DiffusionMap at 0x7f75f1291520>

The first eigenvector in a diffusion map is always essentially all ones (when divided by a constant):

[4]:

dmap._eigenvectors[:, 0]

[4]:

array([0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525, 0.10101525, 0.10101525,
       0.10101525, 0.10101525, 0.10101525])

Therefore, when we embed the trajectory onto the dominant eigenvectors, we ignore the first eigenvector. In order to determine which vectors are dominant, we can examine the eigenvalues for a spectral gap: where the eigenvalues stop decreasing constantly in value.

[5]:

fig, ax = plt.subplots()
ax.plot(dmap.eigenvalues[1:16])

[5]:

[<matplotlib.lines.Line2D at 0x7f75f1291970>]

../../../_images/examples_analysis_reduced_dimensions_diffusion_map_9_1.png

From this plot, we take the first k dominant eigenvectors to be the first five. Below, we transform the trajectory onto these eigenvectors. The time argument is the exponent that the eigenvalues are raised to for embedding. As values increase for time, more dominant eigenvectors (with lower eigenvalues) dominate the diffusion distance more. The transform method returns an array of shape (# frames, # eigenvectors).

[6]:

transformed = dmap.transform(5,  # number of eigenvectors
                     time=1)
transformed.shape

[6]:

(98, 5)

For easier analysis and plotting we can turn the array into a DataFrame.

[7]:

df = pd.DataFrame(transformed,
                  columns=['Mode{}'.format(i+2) for i in range(5)])
df['Time (ps)'] = df.index * u.trajectory.dt
df.head()

[7]:

	Mode2	Mode3	Mode4	Mode5	Mode6	Time (ps)
0	0.094795	0.075950	0.054708	0.035526	0.022757	0.0
1	0.166068	0.132017	0.094409	0.060914	0.038667	1.0
2	0.199960	0.154475	0.107425	0.067632	0.041445	2.0
3	0.228815	0.168694	0.111460	0.067112	0.038469	3.0
4	0.250384	0.171873	0.103407	0.057143	0.028398	4.0

There are several ways we can visualise the data. Using the Seaborn’s PairGrid tool is the quickest and easiest way, if you have seaborn already installed. Each of the subplots below illustrates axes of the lower-dimensional embedding of the higher-dimensional data, such that dots (frames) that are close are kinetically close (connected by a large number of short pathways), whereas greater distance indicates states that are connected by a smaller number of long pathways. Please see [FPKD11] for more information.

Note

You will need to install the data visualisation library Seaborn for this function.

[8]:

import seaborn as sns

g = sns.PairGrid(df, hue='Time (ps)',
                 palette=sns.color_palette('Oranges_d',
                                           n_colors=len(df)))
g.map(plt.scatter, marker='.')

[8]:

<seaborn.axisgrid.PairGrid at 0x7f75f0149520>

../../../_images/examples_analysis_reduced_dimensions_diffusion_map_15_1.png

References

[1] Oliver Beckstein, Elizabeth J. Denning, Juan R. Perilla, and Thomas B. Woolf. Zipping and Unzipping of Adenylate Kinase: Atomistic Insights into the Ensemble of Open↔Closed Transitions. Journal of Molecular Biology, 394(1):160–176, November 2009. 00107. URL: https://linkinghub.elsevier.com/retrieve/pii/S0022283609011164, doi:10.1016/j.jmb.2009.09.009.

[2] Ronald R. Coifman and Stéphane Lafon. Diffusion maps. Applied and Computational Harmonic Analysis, 21(1):5–30, July 2006. 02271. doi:10.1016/j.acha.2006.04.006.

[3] J. de la Porte, B. M. Herbst, W. Hereman, and S. J. van der Walt. An introduction to diffusion maps. In In The 19th Symposium of the Pattern Recognition Association of South Africa. 2008. 00038.

[4] Andrew Ferguson, Athanassios Z. Panagiotopoulos, Ioannis G. Kevrekidis, and Pablo G. Debenedetti. Nonlinear dimensionality reduction in molecular simulation: The diffusion map approach. Chemical Physics Letters, 509(1-3):1–11, June 2011. 00085. doi:10.1016/j.cplett.2011.04.066.

[5] Richard J. Gowers, Max Linke, Jonathan Barnoud, Tyler J. E. Reddy, Manuel N. Melo, Sean L. Seyler, Jan Domański, David L. Dotson, Sébastien Buchoux, Ian M. Kenney, and Oliver Beckstein. MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations. Proceedings of the 15th Python in Science Conference, pages 98–105, 2016. 00152. URL: https://conference.scipy.org/proceedings/scipy2016/oliver_beckstein.html, doi:10.25080/Majora-629e541a-00e.

[6] Naveen Michaud-Agrawal, Elizabeth J. Denning, Thomas B. Woolf, and Oliver Beckstein. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. Journal of Computational Chemistry, 32(10):2319–2327, July 2011. 00778. URL: http://doi.wiley.com/10.1002/jcc.21787, doi:10.1002/jcc.21787.

[7] Mary A. Rohrdanz, Wenwei Zheng, Mauro Maggioni, and Cecilia Clementi. Determination of reaction coordinates via locally scaled diffusion map. The Journal of Chemical Physics, 134(12):124116, March 2011. 00220. doi:10.1063/1.3569857.

[8] Douglas L. Theobald. Rapid calculation of RMSDs using a quaternion-based characteristic polynomial. Acta Crystallographica Section A Foundations of Crystallography, 61(4):478–480, July 2005. 00127. URL: http://scripts.iucr.org/cgi-bin/paper?S0108767305015266, doi:10.1107/S0108767305015266.