Wiki

Clone wiki

pyPcazip / Tutorial

Home

    #1. Introduction.

    In this tutorial we demonstrate how the pyPCAzip toolkit can be used to assess convergence and sampling in MD and compare dynamics between different data sets.

    The data you will be using comes from MD simulations on the mouse Major Urinary Protein (MUP), which is a popular test-bed for the investigation of protein-ligand recognition (see for example Roy and Laughton, Biophys J, 2010, 99(1): 218-226).

    Firstly download and unpack the tutorial data files:

    wget https://bitbucket.org/ramonbsc/pypcazip/downloads/data.tar.gz
    tar -xvf data.tar.gz
    
    In the ./data folder you will find:

    • A topology file for the protein: Protein.gro.
    • Trajectory files for the protein, generated in the absence and presence of the pyrazine ligand: apoProteinTraj1.dcd and holoProteinTraj1.dcd respectively.
    • An “index” file apo+holo.ndx - we will come back to this later.

    If you want to have a quick look at the systems, load the .gro file into your favourite visualisation software, e.g. VMD or pymol. If you want to see more, load the trajectory files too!

    #2. Principal Component Analysis of apo-MUP using pyPcazip.

    ##a) Performing the PCA

    To get a quick guide to pyPcazip usage type:

    pyPcazip --help
    

    The trajectory files have already been stripped of waters, ions, and the ligand; we could do PCA on the whole protein but analysis of just the dynamics of the C-alpha atoms may be a good place to start. So begin by running the PCA job itself.

    pyPcazip -i data/apoProteinTraj1.dcd -t data/Protein.gro -s "name CA" -p CA.pdb -o apo_CA.pcz -v
    

    Two files will be produced: CA.pdb is a PDB format file containing just the atoms selected for analysis (good for checking that your selection string gave you what you intended), and apo_CA.pcz is the PCA output file. This is a compressed binary file, readable using other tools from the pyPcazip package, so let’s do this next, using pyPczdump:

    pyPczdump -i apo_CA.pcz -n
    

    gives basic information about the contents of the file. pyPczdump also allows many other sorts of analysis - for a quick guide type:

    pyPczdump --help
    

    ##b) Looking at eigenvalues.

    Now lets output a list of the eigenvalues:

    pyPczdump -i apo_CA.pcz -l -o apo_CA_evals.dat
    

    You can just scan through this in a terminal window, or input to your favourite graph-drawing package, e.g. gnuplot:

    gnuplot
    gnuplot> plot 'apo_CA_evals.dat' w boxes
    gnuplot> quit
    
    (see Figure 1, below)

    ##c) Analysis of modes.

    Clearly the majority of the dynamics of the free protein can be represented using a small number of collective modes (the eigenvectors). To get a feel for the sorts of motions these are, produce animations of them in the form of multi-model PDB format files:

    pyPczdump -i apo_CA.pcz --anim 1 -o apo_CA_ev1.pdb
    

    This command will produce a pub file to animate the first collective mode; this can be visualised using, e.g., VMD. (HINT: Graphics>Representations…>Drawing Method>Trace)

    ##d) Analysis of convergence and sampling.

    Having got a feel for the types of motion represented by each collective mode, we can see how these are sampled over the trajectory by outputting the time series of their projections, e.g.:

    pyPczdump -i apo_CA.pcz --proj 1 -o apo_CA_proj1.dat
    pyPczdump -i apo_CA.pcz --proj 2 -o apo_CA_proj2.dat
    
    for modes 1 and 2. Again these can be visualised through, e.g. gnuplot (Figure 2)

    Figure 1 Figure 2
    tutorialfigure1.jpg tutorialfigure2.jpg

    ##e) Introducing the pyPczplot tool. The process of running pyPczdump to produce an output file, and then turning this into some form of graph in a second step, is flexible but rather tedious. _pyPczplot__ is a trial version of an analysis tool that does both.

    pyPczplot -i apo_CA.pcz
    

    The start-up screen shows a plot of the eigenvalues, and summary data about the .pcz file (Figure 3)

    Press ‘1’ to show the time-series for the projections along the first PC (Figure 4). Pressing the ‘i’ and ‘k’ keys takes you to higher/lower PCs.

    Press ‘2’ to produce histograms of the data (Figure 5). A PC that is sampled in a perfectly harmonic way will give a Gaussian distribution with a variance equal to the eigenvalue (shown by the dotted line); you can see how well (or not) your data fits this model. Use the ‘j’ and ‘l’ keys to change the selected PC.

    Press ‘3’ to produce a ‘map’ showing how the trajectory moves over a 2D-subspace defined by two principal modes (Figure 6). Which two can be selected using the j/l and i/k keys. Press ‘a’ to animate the map - the "jumping amongst minima" nature of the trajectory should be apparent.

    Press ‘4’ to convert the map into a 2D histogram (Figure 7). This can presented in the form of a heat map or contour plot - press ‘c’ to toggle between them.

    Figure 3 Figure 4 Figure 5 Figure 6 Figure 7
    tutorialfigure3.png tutorialfigure4.png tutorialfigure5.png tutorialfigure6.png tutorialfigure7.png

    ##e) Further work. * Experiment with choosing different selections of atoms for the analysis (‘-s’ argument in step a)). * Repeat the whole process for the dataset from MUP bound to the pyrazine ligand (holo).

    #3. Comparative analysis of the dynamics of MUP in the presence and absence of a ligand.

    ##a) Calculation of the dot product matrix using pyPczcomp.

    If you visualise the first collective mode of MUP in the free state, and compare it with the same mode for MUP in the bound state, you should be able to see, even by eye, that they are not exactly the same. We can put this on a quantitative footing by calculating the dot product matrix between the eigenvectors found from PCA on the apo protein, with those found from PCA on the ligand-bound form.

    pyPczcomp -i apo_CA.pcz holo_CA.pcz --quick
    

    You should see that indeed the eigenvectors are rather different (average maximum dot product only 0.282), and overall the eigenvectors of the two simulations define a subspace that is only moderately conserved (subspace overlap 0.461).

    ##b) Combined PCA on both trajectories.

    To analyse the relative sampling and convergence of both simulations within a common subspace, we need to perform PCA on them both together:

    pyPcazip -i data/apoProteinTraj1.dcd data/holoProteinTraj1.dcd -t data/Protein.gro -s "name CA" -o apo+holo_CA.pcz -vv
    

    (note how we give a list of trajectory files to be analysed in the ‘-i’ argument).

    ##c) Visualisation of the data. Use pyPczplot to visualize the results. In this case, also specify an index file, which provides information about which snapshots in the combined trajectory used for the PCA belong to which simulation:

    pyPczplot -i apo+holo_CA.pcz -n data/apo+holo.ndx
    

    Using the 2D map option (key ‘4’) see how the apo simulation samples an area of PC1/PC2 space that is not sampled in the holo simulation.

    Updated