Wiki

Clone wiki

pyPcazip / Home


Introduction

Principal Component Analysis (PCA) forms the basis of a powerful range of simulation analysis methods, and also provides an approach to very efficient trajectory data compression. The pyPcazip package is a re-implementation by the ExTASY project www.extasy-project.org of the PCAZIP toolkit developed and distributed by the Laughton Group at the University of Nottingham, UK, and the Orozco group at the University of Barcelona, Spain.

The pyPcazip package provides four tools, used from the command-line:

  • pyPcazip: Performs PCA on one or more trajectory files, outputting a compressed binary "pcz file".
  • pyPcaunzip: Can uncompress a pcz file to regenerate a trajectory file.
  • pyPczdump: Extracts selected information (e.g. eigenvalues, projectons,...) from a pcz file.
  • pyPczcomp: Generates metrics to compare the contents of one pcz file against another (e.g. dot product matrices).

In addition a trial version of a graphical visualisation tool 'pyPczplot' is also included - this will be extended and properly supported in future releases.

Applications

##Data Compression: pyPcazip can compress MD trajectories to a small fraction (few percent) of their original size with no significant loss of information. pyPcazip is compatible with most of the common trajectory file formats (AMBER, CHARMM, GROMACS, NAMD, etc.). The trade-off between degree of compression and precision (in terms of fraction of the total variance that is retained) is under complete user control.

pyPcazip_2.0_storage.png

The compressed files can be loaded directly into VMD if the required molfile plugin is installed.

##Data Analysis: Trajectory compression with pyPcazip provides the gateway to a range of analysis methods that provide objective, quantitative and comparative metrics related to convergence and sampling, and the similarity between one trajectory and another.

  • Checking data quality: Metrics extracted from PCA provide a robust check on issues such as equilibration and sampling. Below (left panel) we see an example where conventional RMSD analysis predicts that a simulation has equilibrated after about 200 ps (blue), while plotting the projection of the first principal component (red) reveals a rather different story – a major conformational change at 900 ps.

  • Analysing conformational change: Animations of principal components can give a clear picture of major concerted motions – e.g. interdomain dynamics. Histograms of projection data in low dimensional spaces (typically 2D or 3D, centre panel) provide an effective and visually attractive way to identify sub-states and conformational sampling.

  • Quantitative comparisons of trajectories: Principal component analysis is a powerful tool for the comparative analysis of multiple datasets, evaluating issues such as the key differences and similarities between dynamical behaviour in different environments (right panel).

Checking data quality analysing conformational sampling quantitative comparison of trajectories
equilibration.jpg histogram.jpg dotproduct.jpg

Installation

On unix-based systems that already have numpy, scipy, and cython installed, pyPcazip can be installed in a single step:

pip install --user pyPcazip

Installing in a virtual environment using Miniconda:

(instructions courtesy of Marco Pasi):

Download the miniconda installer from here, and run it, e.g.:

bash Miniconda2-latest-Linux-x86_64.sh

Using miniconda becomes easier if we add the miniconda installation to our path:

export PATH=~/miniconda2/bin:$PATH

(change ~/miniconda2 if you've customised the installation location during the installation procedure). For the use of conda see also the official documentation: http://conda.pydata.org/docs.

Now create an environment where we will install pyPcazip and all its dependencies (here using python2.7):

conda create -n pypcazip python

Activate the environment and install all dependencies using the conda package manager:

source activate pypcazip
conda install numpy scipy cython h5py netcdf4

The pyPcazip package is not available from the Conda repositories (yet), so we can install it using pip:

pip install pyPcazip

(this will also install the MDTraj package). Done! Now you can test your installation as specified above.

Installation from source:

For details on how to install pyPcazip from source code and on adding support for extra features such as reading AMBER netcdf format trajectory files, please go to any of the following sections:

Introductory tutorial

An introduction to the use of the pyPcazip package is available here

The MDPlus API

When you install pyPcazip you also get the MDPlus python library that powers the command line tools. This can be used within your own python scripts: see here for a guide to the API.

Developer notes

Information related to pyPcazip development

FAQs

  • pyPcazip doesn't seem to recognise the format of my trajectory/topology file. The pyPcazip tools identify file formats by their extension, using the underlying MDTraj package.

Further details

A pdf document outlining the theory behind MD trajectory file compression using PCA and the pyPcazip tools developed to implement this is available here.

logos.jpg

Updated