Principal Component Analysis (PCA) forms the basis of a powerful range of simulation analysis methods, and also provides an approach to very efficient trajectory data compression. The pyPcazip package is a re-implementation by the ExTASY project www.extasy-project.org of the PCAZIP toolkit developed and distributed by the Laughton Group at the University of Nottingham, UK, and the Orozco group at the University of Barcelona, Spain.
The pyPcazip package provides four tools, used from the command-line:
- pyPcazip: Performs PCA on one or more trajectory files, outputting a compressed binary "pcz file".
- pyPcaunzip: Can uncompress a pcz file to regenerate a trajectory file.
- pyPczdump: Extracts selected information (e.g. eigenvalues, projectons,...) from a pcz file.
- pyPczcomp: Generates metrics to compare the contents of one pcz file against another (e.g. dot product matrices).
In addition a trial version of a graphical visualisation tool 'pyPczplot' is also included - this will be extended and properly supported in future releases.
##Data Compression: pyPcazip can compress MD trajectories to a small fraction (few percent) of their original size with no significant loss of information. pyPcazip is compatible with most of the common trajectory file formats (AMBER, CHARMM, GROMACS, NAMD, etc.). The trade-off between degree of compression and precision (in terms of fraction of the total variance that is retained) is under complete user control.
##Data Analysis: Trajectory compression with pyPcazip provides the gateway to a range of analysis methods that provide objective, quantitative and comparative metrics related to convergence and sampling, and the similarity between one trajectory and another.
Checking data quality: Metrics extracted from PCA provide a robust check on issues such as equilibration and sampling. Below (left panel) we see an example where conventional RMSD analysis predicts that a simulation has equilibrated after about 200 ps (blue), while plotting the projection of the first principal component (red) reveals a rather different story – a major conformational change at 900 ps.
Analysing conformational change: Animations of principal components can give a clear picture of major concerted motions – e.g. interdomain dynamics. Histograms of projection data in low dimensional spaces (typically 2D or 3D, centre panel) provide an effective and visually attractive way to identify sub-states and conformational sampling.
Quantitative comparisons of trajectories: Principal component analysis is a powerful tool for the comparative analysis of multiple datasets, evaluating issues such as the key differences and similarities between dynamical behaviour in different environments (right panel).
|Checking data quality||analysing conformational sampling||quantitative comparison of trajectories|
On unix-based systems that already have numpy, scipy, and cython installed, pyPcazip can be installed in a single step:
pip install --user pyPcazip
Installing in a virtual environment using Miniconda:
(instructions courtesy of Marco Pasi):
Download the miniconda installer from here, and run it, e.g.:
Using miniconda becomes easier if we add the miniconda installation to our path:
~/miniconda2 if you've customised the installation location during the installation procedure). For the use of
conda see also the official documentation: http://conda.pydata.org/docs.
Now create an environment where we will install pyPcazip and all its dependencies (here using python2.7):
conda create -n pypcazip python
Activate the environment and install all dependencies using the
conda package manager:
source activate pypcazip conda install numpy scipy cython h5py netcdf4
The pyPcazip package is not available from the Conda repositories (yet), so we can install it using
pip install pyPcazip
(this will also install the MDTraj package). Done! Now you can test your installation as specified above.
Installation from source:
For details on how to install pyPcazip from source code and on adding support for extra features such as reading AMBER netcdf format trajectory files, please go to any of the following sections:
An introduction to the use of the pyPcazip package is available here
The MDPlus API
When you install pyPcazip you also get the MDPlus python library that powers the command line tools. This can be used within your own python scripts: see here for a guide to the API.
- pyPcazip doesn't seem to recognise the format of my trajectory/topology file. The pyPcazip tools identify file formats by their extension, using the underlying MDTraj package.
A pdf document outlining the theory behind MD trajectory file compression using PCA and the pyPcazip tools developed to implement this is available here.