Calculating the root mean square deviation of atomic structures

We calculate the RMSD of domains in adenylate kinase as it transitions from an open to closed structure, and look at calculating weighted RMSDs.

Last updated: June 26, 2020 with MDAnalysis 1.0.0

Minimum version of MDAnalysis: 1.0.0

Packages required:

See also

Note

MDAnalysis implements RMSD calculation using the fast QCP algorithm ([The05]). Please cite ([The05]) when using the MDAnalysis.analysis.align module in published work.

[1]:
import MDAnalysis as mda
from MDAnalysis.tests.datafiles import PSF, DCD, CRD
from MDAnalysis.analysis import rms

import pandas as pd
# the next line is necessary to display plots in Jupyter
%matplotlib inline

Loading files

The test files we will be working with here feature adenylate kinase (AdK), a phosophotransferase enzyme. ([BDPW09]) The trajectory DCD samples a transition from a closed to an open conformation. AdK has three domains:

  • CORE

  • LID: an ATP-binding domain

  • NMP: an AMP-binding domain

The LID and NMP domains move around the stable CORE as the enzyme transitions between the opened and closed conformations. One way to quantify this movement is by calculating the root mean square deviation (RMSD) of atomic positions.

[2]:
u = mda.Universe(PSF, DCD)  # closed AdK (PDB ID: 1AKE)
ref = mda.Universe(PSF, CRD)  # open AdK (PDB ID: 4AKE)

Background

The root mean square deviation (RMSD) of particle coordinates is one measure of distance, or dissimilarity, between molecular conformations. Each structure should have matching elementwise atoms \(i\) in the same order, as the distance between them is calculated and summed for the final result. It is calculated between coordinate arrays \(\mathbf{x}\) and \(\mathbf{x}^{\text{ref}}\) according to the equation below:

\[\text{RMSD}(\mathbf{x}, \mathbf{x}^{\text{ref}}) = \sqrt{\frac{1}{n} \sum_{i=1}^{n}{|\mathbf{x}_i-\mathbf{x}_i^{\text{ref}}|^2}}\]

As molecules can move around, the structure \(\mathbf{x}\) is usually translated by a vector \(\mathbf{t}\) and rotated by a matrix \(\mathsf{R}\) to align with the reference \(\mathbf{x}^{\text{ref}}\) such that the RMSD is minimised. The RMSD after this optimal superposition can be expressed as follows:

\[\text{RMSD}(\mathbf{x}, \mathbf{x}^{\text{ref}}) = \min_{\mathsf{R}, \mathbf{t}} % \sqrt{\frac{1}{N} \sum_{i=1}^{N} \left[ % (\mathsf{R}\cdot\mathbf{x}_{i}(t) + \mathbf{t}) - \mathbf{x}_{i}^{\text{ref}} \right]^{2}}\]

The RMSD between one reference state and a trajectory of structures is often calculated as a way to measure the dissimilarity of the trajectory conformational ensemble to the reference. This reference is frequently the first frame of the trajectory (the default in MDAnalysis), in which case it can provide insight into the overall movement from the initial starting point. While stable RMSD values from a reference structure are frequently used as a measure of conformational convergence, this metric suffers from the problem of degeneracy: many different structures can have the same RMSD from the same reference. For an alternative measure, you could use pairwise or 2D RMSD.

Typically not all coordinates in a structures are included in an RMSD analysis. With proteins, the fluctuation of the residue side-chains is not representative of overall conformational change. Therefore when RMSD analyses are performed to investigate large-scale movements in proteins, the atoms are usually restricted only to the backbone atoms (forming the amide-bond chain) or the alpha-carbon atoms.

MDAnalysis provides functions and classes to calculate the RMSD between coordinate arrays, and Universes or AtomGroups.

The contribution of each particle \(i\) to the final RMSD value can also be weighted by \(w_i\):

\[\text{RMSD}(\mathbf{x}, \mathbf{x}^\text{ref}) = \sqrt{\frac{\sum_{i=1}^{n}{w_i|\mathbf{x}_i-\mathbf{x}_i^{\text{ref}}|^2}}{\sum_{i-1}^n w_i}}\]

RMSD analyses are frequently weighted by mass. The MDAnalysis RMSD class allows you to both select mass-weighting with weights='mass' or weights_groupselections='mass', or to pass custom arrays into either keyword.

RMSD between two sets of coordinates

The MDAnalysis.analysis.rms.rmsd function returns the root mean square deviation (in Angstrom) between two sets of coordinates. Here, we calculate the RMSD between the backbone atoms of the open and closed conformations of AdK. Only considering the backbone atoms is often more helpful than calculating the RMSD for all the atoms, as movement in amino acid side-chains isn’t indicative of overall conformational change.

[3]:
rms.rmsd(u.select_atoms('backbone').positions,  # coordinates to align
         ref.select_atoms('backbone').positions,  # reference coordinates
         center=True,  # subtract the center of geometry
         superposition=True)  # superimpose coordinates
[3]:
6.823686867261616

RMSD of a Universe with multiple selections

It is more efficient to use the MDAnalysis.analysis.rms.RMSD class to calculate the RMSD of an entire trajectory to a single reference point, than to use the the MDAnalysis.analysis.rms.rmsd function.

The rms.RMSD class first performs a rotational and translational alignment of the target trajectory to the reference universe at ref_frame, using the atoms in select to determine the transformation. The RMSD of the select selection is calculated. Then, without further alignment, the RMSD of each group in groupselections is calculated.

[4]:
CORE = 'backbone and (resid 1-29 or resid 60-121 or resid 160-214)'
LID = 'backbone and resid 122-159'
NMP = 'backbone and resid 30-59'
[5]:
R = rms.RMSD(u,  # universe to align
             u,  # reference universe or atomgroup
             select='backbone',  # group to superimpose and calculate RMSD
             groupselections=[CORE, LID, NMP],  # groups for RMSD
             ref_frame=0)  # frame index of the reference
R.run()
[5]:
<MDAnalysis.analysis.rms.RMSD at 0x11ad568d0>

The data is saved in R.rmsd as an array.

[6]:
R.rmsd.shape
[6]:
(98, 6)

R.rmsd has a row for each timestep. The first two columns of each row are the frame index of the time step, and the time (which is guessed in trajectory formats without timesteps). The third column is RMSD of select. The last few columns are the RMSD of the groups in groupselections.

Plotting the data

We can easily plot this data using the common data analysis package pandas. We turn the R.rmsd array into a DataFrame and label each column below.

[7]:
df = pd.DataFrame(R.rmsd,
                  columns=['Frame', 'Time (ns)',
                           'Backbone', 'CORE',
                           'LID', 'NMP'])

df
[7]:
Frame Time (ns) Backbone CORE LID NMP
0 0.0 1.000000 5.834344e-07 3.921486e-08 1.197000e-07 6.276497e-08
1 1.0 2.000000 4.636592e-01 4.550181e-01 4.871915e-01 4.745572e-01
2 2.0 3.000000 6.419340e-01 5.754418e-01 7.940994e-01 7.270191e-01
3 3.0 4.000000 7.743983e-01 6.739184e-01 1.010261e+00 8.795031e-01
4 4.0 5.000000 8.588600e-01 7.318859e-01 1.168397e+00 9.612989e-01
... ... ... ... ... ... ...
93 93.0 93.999992 6.817898e+00 3.504430e+00 1.143376e+01 1.029266e+01
94 94.0 94.999992 6.804211e+00 3.480681e+00 1.141134e+01 1.029879e+01
95 95.0 95.999992 6.807987e+00 3.508946e+00 1.137593e+01 1.031958e+01
96 96.0 96.999991 6.821205e+00 3.498081e+00 1.139156e+01 1.037768e+01
97 97.0 97.999991 6.820322e+00 3.507119e+00 1.138474e+01 1.036821e+01

98 rows × 6 columns

[8]:
ax = df.plot(x='Frame', y=['Backbone', 'CORE', 'LID', 'NMP'],
             kind='line')
ax.set_ylabel(r'RMSD ($\AA$)')
[8]:
Text(0, 0.5, 'RMSD ($\\AA$)')
../../../_images/examples_analysis_alignment_and_rms_rmsd_18_1.png

RMSD of an AtomGroup with multiple selections

The RMSD class accepts both AtomGroups and Universes as arguments. Restricting the atoms considered to an AtomGroup can be very helpful, as the select and groupselections arguments are applied only to the atoms in the original AtomGroup. In the example below, for example, only the alpha-carbons of the CORE domain are incorporated in the analysis.

Note

This feature does not currently support groupselections.

[9]:
ca = u.select_atoms('name CA')

R = rms.RMSD(ca, ca, select=CORE, ref_frame=0)
R.run()
[9]:
<MDAnalysis.analysis.rms.RMSD at 0x11a837ed0>
[10]:
df = pd.DataFrame(R.rmsd,
                  columns=['Frame', 'Time (ns)',
                           'CORE'])

ax = df.plot(x='Frame', y='CORE', kind='line')
ax.set_ylabel('RMSD ($\AA$)')
[10]:
Text(0, 0.5, 'RMSD ($\\AA$)')
../../../_images/examples_analysis_alignment_and_rms_rmsd_21_1.png

Weighted RMSD of a trajectory

You can also calculate the weighted RMSD of a trajectory using the weights and weights_groupselections keywords. The former only applies weights to the group in select, while the latter must be a list of the same length and order as groupselections. If you would like to only weight certain groups in groupselections, use None for the unweighted groups. Both weights and weights_groupselections accept None (for unweighted), 'mass' (to weight by mass), and custom arrays. A custom array should have the same number of values as there are particles in the corresponding AtomGroup.

Mass

It is common to weight RMSD analyses by particle weight, so the RMSD class accepts 'mass' as an argument for weights. Note that the weights keyword only applies to the group in select, and does not apply to the groupselections; these remain unweighted below.

[11]:
R_mass = rms.RMSD(u, u,
                  select='protein and name CA',
                  weights='mass',
                  groupselections=[CORE, LID, NMP])
R_mass.run()
[11]:
<MDAnalysis.analysis.rms.RMSD at 0x11a87d890>

The plot looks largely the same as above because all the alpha-carbons and individual backbone atoms have the same mass. Below we show how you can pass in custom weights.

[12]:
df_mass = pd.DataFrame(R_mass.rmsd,
                       columns=['Frame',
                                'Time (ns)',
                                'C-alphas', 'CORE',
                                'LID', 'NMP'])
ax_mass = df_mass.plot(x='Frame',
                       y=['C-alphas', 'CORE', 'LID', 'NMP'])
ax_mass.set_ylabel('Mass-weighted RMSD ($\AA$)');
../../../_images/examples_analysis_alignment_and_rms_rmsd_27_0.png

Custom weights

You can also pass in an array of values for the weights. This must have the same length as the number of atoms in your selection groups. In this example, we pass in arrays of residue numbers. This is not a typical weighting you might choose, but we use it here to show how you can get different results to the previous graphs for mass-weighted and non-weighted RMSD. You could choose to pass in an array of charges, or your own custom value.

First we select the atom groups to make this easier.

[13]:
ag = u.select_atoms('protein and name CA')
print('Shape of C-alpha charges:', ag.charges.shape)
core = u.select_atoms(CORE)
lid = u.select_atoms(LID)
nmp = u.select_atoms(NMP)
Shape of C-alpha charges: (214,)

Below, we pass in weights for ag with the weights keyword, and weights for the groupselections with the weights_groupselections keyword.

[14]:
R_charge = rms.RMSD(u, u,
                    select='protein and name CA',
                    groupselections=[CORE, LID, NMP],
                    weights=ag.resids,
                    weights_groupselections=[core.resids,
                                             lid.resids,
                                             nmp.resids])
R_charge.run()
[14]:
<MDAnalysis.analysis.rms.RMSD at 0x11a84e250>

You can see in the below graph that the NMP domain has much higher RMSD than before, as each particle is weighted by its residue number. Conversely, the CORE domain doesn’t really change much. We could potentially infer from this that residues later in the NMP domain (with a higher residue number) are mobile during the length of the trajectory, whereas residues later in the CORE domain do not seem to contribute significantly to the RMSD.

However, again, this is a very non-conventional metric and is shown here simply to demonstrate how to use the code.

[15]:
df_charge = pd.DataFrame(R_charge.rmsd,
                         columns=['Frame',
                                'Time (ns)',
                                'C-alphas', 'CORE',
                                'LID', 'NMP'])
ax_charge = df_charge.plot(x='Frame',
                           y=['C-alphas', 'CORE', 'LID', 'NMP'])
ax_charge.set_ylabel('Charge-weighted RMSD ($\AA$)')
[15]:
Text(0, 0.5, 'Charge-weighted RMSD ($\\AA$)')
../../../_images/examples_analysis_alignment_and_rms_rmsd_34_1.png

References

[1] Oliver Beckstein, Elizabeth J. Denning, Juan R. Perilla, and Thomas B. Woolf. Zipping and Unzipping of Adenylate Kinase: Atomistic Insights into the Ensemble of Open↔Closed Transitions. Journal of Molecular Biology, 394(1):160–176, November 2009. 00107. URL: https://linkinghub.elsevier.com/retrieve/pii/S0022283609011164, doi:10.1016/j.jmb.2009.09.009.

[2] Richard J. Gowers, Max Linke, Jonathan Barnoud, Tyler J. E. Reddy, Manuel N. Melo, Sean L. Seyler, Jan Domański, David L. Dotson, Sébastien Buchoux, Ian M. Kenney, and Oliver Beckstein. MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations. Proceedings of the 15th Python in Science Conference, pages 98–105, 2016. 00152. URL: https://conference.scipy.org/proceedings/scipy2016/oliver_beckstein.html, doi:10.25080/Majora-629e541a-00e.

[3] Naveen Michaud-Agrawal, Elizabeth J. Denning, Thomas B. Woolf, and Oliver Beckstein. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. Journal of Computational Chemistry, 32(10):2319–2327, July 2011. 00778. URL: http://doi.wiley.com/10.1002/jcc.21787, doi:10.1002/jcc.21787.

[4] Douglas L. Theobald. Rapid calculation of RMSDs using a quaternion-based characteristic polynomial. Acta Crystallographica Section A Foundations of Crystallography, 61(4):478–480, July 2005. 00127. URL: http://scripts.iucr.org/cgi-bin/paper?S0108767305015266, doi:10.1107/S0108767305015266.