Principal component analysis of a trajectory¶

Here we compute the principal component analysis of a trajectory.

Last executed: May 19, 2021 with MDAnalysis 1.1.1

Last updated: Dec 2020

Minimum version of MDAnalysis: 1.0.0

Packages required:

Optional packages for visualisation:

[1]:

import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD
from MDAnalysis.analysis import pca, align
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import nglview as nv
import warnings
# suppress some MDAnalysis warnings about writing PDB files
warnings.filterwarnings('ignore')
%matplotlib inline


The test files we will be working with here feature adenylate kinase (AdK), a phosophotransferase enzyme. ([BDPW09]) The trajectory DCD samples a transition from a closed to an open conformation.

[2]:

u = mda.Universe(PSF, DCD)


Principal component analysis¶

Principal component analysis (PCA) is a statistical technique that decomposes a system of observations into linearly uncorrelated variables called principal components. These components are ordered so that the first principal component accounts for the largest variance in the data, and each following component accounts for lower and lower variance. PCA is often applied to molecular dynamics trajectories to extract the large-scale conformational motions or “essential dynamics” of a protein. The frame-by-frame conformational fluctuation can be considered a linear combination of the essential dynamics yielded by the PCA.

In MDAnalysis, the method implemented in the PCA class (API docs) is as follows:

1. Optionally align each frame in your trajectory to the first frame.

2. Construct a 3N x 3N covariance for the N atoms in your trajectory. Optionally, you can provide a mean; otherwise the covariance is to the averaged structure over the trajectory.

3. Diagonalise the covariance matrix. The eigenvectors are the principal components, and their eigenvalues are the associated variance.

4. Sort the eigenvalues so that the principal components are ordered by variance.

Note

Principal component analysis algorithms are deterministic, but the solutions are not unique. For example, you could easily change the sign of an eigenvector without altering the PCA. Different algorithms are likely to produce different answers, due to variations in implementation. MDAnalysis may not return the same values as another package.

You can choose how many principal components to save from the analysis with n_components. The default value is None, which saves all of them. You can also pass a mean reference structure to be used in calculating the covariance matrix. With the default value of None, the covariance uses the mean coordinates of the trajectory.

[3]:

pc = pca.PCA(u, select='backbone',
align=True, mean=None,
n_components=None).run()


The principal components are saved in pc.p_components. If you kept all the components, you should have an array of shape $$(n_{atoms}\times3, n_{atoms}\times3)$$.

[4]:

backbone = u.select_atoms('backbone')
n_bb = len(backbone)
print('There are {} backbone atoms in the analysis'.format(n_bb))
print(pc.p_components.shape)

There are 855 backbone atoms in the analysis
(2565, 2565)


The variance of each principal component is in pc.variance. For example, to get the variance explained by the first principal component:

[5]:

pc.variance[0]

[5]:

4203.1902583353685


This variance is somewhat meaningless by itself. It is much more intuitive to consider the variance of a principal component as a percentage of the total variance in the data. MDAnalysis also tracks the percentage cumulative variance in pc.cumulated_variance. As shown below, the first principal component contains 90.3% the total trajectory variance. The first three components combined account for 96.4% of the total variance.

[6]:

for i in range(3):
print(f"Cumulated variance: {pc.cumulated_variance[i]:.3f}")

Cumulated variance: 0.903
Cumulated variance: 0.951
Cumulated variance: 0.964

[7]:

plt.plot(pc.cumulated_variance[:10])
plt.xlabel('Principal component')
plt.ylabel('Cumulative variance');


Visualising projections into a reduced dimensional space¶

The pc.transform() method transforms a given atom group into weights $$\mathbf{w}_i$$ over each principal component $$i$$.

$\mathbf{w}_i(t) = (\mathbf{r}(t)-\mathbf{\overline{r}}) \cdot \mathbf{u}_i$

$$\mathbf{r}(t)$$ are the atom group coordinates at time $$t$$, $$\mathbf{\overline{r}}$$ are the mean coordinates used in the PCA, and $$\mathbf{u}_i$$ is the $$i$$th principal component eigenvector $$\mathbf{u}$$.

While the given atom group must have the same number of atoms that the principal components were calculated over, it does not have to be the same group.

Again, passing n_components=None will tranform your atom group over every component. Below, we limit the output to projections over 5 principal components only.

[8]:

transformed = pc.transform(backbone, n_components=5)
transformed.shape

[8]:

(98, 5)


The output has the shape (n_frames, n_components). For easier analysis and plotting we can turn the array into a DataFrame.

[9]:

df = pd.DataFrame(transformed,
columns=['PC{}'.format(i+1) for i in range(5)])
df['Time (ps)'] = df.index * u.trajectory.dt

[9]:

PC1 PC2 PC3 PC4 PC5 Time (ps)
0 118.408403 29.088239 15.746624 -8.344273 -1.778052 0.0
1 115.563166 26.787417 14.653191 -6.619938 -0.629419 1.0
2 112.678994 25.036888 12.919944 -4.319501 -0.158978 2.0
3 110.342249 24.309839 11.431563 -3.884708 -0.168643 3.0
4 107.591883 23.472642 11.625078 -2.273100 -1.479566 4.0

There are several ways we can visualise the data. Using the Seaborn’s PairGrid tool is the quickest and easiest way, if you have seaborn already installed.

Note

You will need to install the data visualisation library Seaborn for this function.

[10]:

import seaborn as sns

g = sns.PairGrid(df, hue='Time (ps)',
palette=sns.color_palette('Oranges_d',
n_colors=len(df)))
g.map(plt.scatter, marker='.')

[10]:

<seaborn.axisgrid.PairGrid at 0x7fe4897b4f10>


Another way to investigate the essential motions of the trajectory is to project the original trajectory onto each of the principal components, to visualise the motion of the principal component. The product of the weights $$\mathbf{w}_i(t)$$ for principal component $$i$$ with the eigenvector $$\mathbf{u}_i$$ describes fluctuations around the mean on that axis, so the projected trajectory $$\mathbf{r}_i(t)$$ is simply the fluctuations added onto the mean positions $$\mathbf{\overline{r}}$$.

$\mathbf{r}_i(t) = \mathbf{w}_i(t) \times \mathbf{u}_i + \mathbf{\overline{r}}$

Below, we generate the projected coordinates of the first principal component. The mean positions are stored at pc.mean.

[11]:

pc1 = pc.p_components[:, 0]
trans1 = transformed[:, 0]
projected = np.outer(trans1, pc1) + pc.mean
coordinates = projected.reshape(len(trans1), -1, 3)


We can create a new universe from this to visualise the movement over the first principal component.

[12]:

proj1 = mda.Merge(backbone)

[12]:

<Universe with 855 atoms>

[13]:

view = nv.show_mdanalysis(proj1.atoms)
view


If you have nglview installed, you can view the trajectory in the notebook. Otherwise, you can write the trajectory out to a file and use another program such as VMD. Below, we create a movie of the component.

[14]:

from nglview.contrib.movie import MovieMaker
movie = MovieMaker(view, output='pc1.gif', in_memory=True)
movie.make()

You have to install moviepy, imageio and ffmeg
pip install moviepy==0.2.2.11
pip install imageio==1.6


Measuring convergence with cosine content¶

The essential modes of a trajectory usually describe global, collective motions. The cosine content of a principal component can be interpreted to determine whether proteins are transitioning between conformational states. However, random diffusion can also appear to produce collective motion. The cosine content can measure the convergence of a trajectory and indicate poor sampling.

The cosine content of a principal component measures how similar it is to a cosine shape. Values range from 0 (no similarity to a cosine) and 1 (a perfect cosine shape). If the values of the first few principal components are close to 1, this can indicate poor sampling, as the motion of the particles may not be distinguished from random diffusion. Values below 0.7 do not indicate poor sampling.

Note

[Hes02] first published the usage of cosine content to evaluate sampling. Please cite this paper when using the MDAnalysis.analysis.pca.cosine_content method in published work.

Below we calculate the cosine content of the first five principal components in the transformed subspace. Note that this is an example only, to dmonstrate how to use the method; the first few principal components of short simulations always represent random diffusion ([Hes02]).

[15]:

for i in range(5):
cc = pca.cosine_content(transformed, i)
print(f"Cosine content for PC {i+1} = {cc:.3f}")

Cosine content for PC 1 = 0.960
Cosine content for PC 2 = 0.902
Cosine content for PC 3 = 0.723
Cosine content for PC 4 = 0.621
Cosine content for PC 5 = 0.699


As can be seen, the cosine content of each component is quite high. If we plot the transformed components over time, we can see that each component does resemble a cosine curve.

[16]:

# melt the dataframe into a tidy format
melted = pd.melt(df, id_vars=["Time (ps)"],
var_name="PC",
value_name="Value")
g = sns.FacetGrid(melted, col="PC")
g.map(sns.lineplot,
"Time (ps)", # x-axis
"Value", # y-axis
ci=None) # no confidence interval

[16]:

<seaborn.axisgrid.FacetGrid at 0x7fa10ebbac10>


References¶

[1] Oliver Beckstein, Elizabeth J. Denning, Juan R. Perilla, and Thomas B. Woolf. Zipping and Unzipping of Adenylate Kinase: Atomistic Insights into the Ensemble of Open↔Closed Transitions. Journal of Molecular Biology, 394(1):160–176, November 2009. 00107. URL: https://linkinghub.elsevier.com/retrieve/pii/S0022283609011164, doi:10.1016/j.jmb.2009.09.009.

[2] Richard J. Gowers, Max Linke, Jonathan Barnoud, Tyler J. E. Reddy, Manuel N. Melo, Sean L. Seyler, Jan Domański, David L. Dotson, Sébastien Buchoux, Ian M. Kenney, and Oliver Beckstein. MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations. Proceedings of the 15th Python in Science Conference, pages 98–105, 2016. 00152. URL: https://conference.scipy.org/proceedings/scipy2016/oliver_beckstein.html, doi:10.25080/Majora-629e541a-00e.

[3] Berk Hess. Convergence of sampling in protein simulations. Physical Review E, 65(3):031910, March 2002. 00348. URL: https://link.aps.org/doi/10.1103/PhysRevE.65.031910, doi:10.1103/PhysRevE.65.031910.

[4] Gia G. Maisuradze, Adam Liwo, and Harold A. Scheraga. Principal component analysis for protein folding dynamics. Journal of molecular biology, 385(1):312–329, January 2009. URL: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2652707/, doi:10.1016/j.jmb.2008.10.018.

[5] Naveen Michaud-Agrawal, Elizabeth J. Denning, Thomas B. Woolf, and Oliver Beckstein. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. Journal of Computational Chemistry, 32(10):2319–2327, July 2011. 00778. URL: http://doi.wiley.com/10.1002/jcc.21787, doi:10.1002/jcc.21787.