1. Andrew Ferguson
  2. viralfitness_matlab

Overview

HTTPS SSH

README

Synopsis

Code
  • HReconstr_Potts.m - code to translate multiple sequence alignment (MSA) of viral protein strains into Potts spin glass model of fitness landscape by semi-analytic Newton descent

  • HReconstr_Potts_trjAnalyzer.m - code to plot trajectory files written by HReconstr_Potts.m to assess model convergence

Input Files
  • MSA.mat - toy multiple sequence alignment (MSA) containing 100 sequences of a 4-residue viral protein

References

  1. A.L. Ferguson, J.K. Mann, S. Omarjee, T. Ndung’u, B.D. Walker and A.K. Chakraborty “Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design” Immunity 38 606-617 (2013)

  2. K. Shekhar, C.F. Ruberman, A.L. Ferguson, J.P. Barton, M. Kardar, A.K. Chakraborty "Spin models inferred from patient-derived viral sequence data faithfully describe HIV fitness landscapes" Phys. Rev. E 88 062705 (2013)

  3. J.K. Mann, J.P. Barton, A.L. Ferguson, S. Omarjee, B.D. Walker, A.K. Chakraborty and T. Ndung’u "The fitness landscape of HIV-1 gag: Advanced modeling approaches and validation of model predictions by in vitro testing" PLOS Comput. Biol. 10 8 e1003776 (2014)

  4. G.R. Hart and A.L. Ferguson "Error catastrophe and phase transition in the empirical fitness landscape of HIV" Phys. Rev. E 91 032705 (2015)

  5. G.R. Hart and A.L. Ferguson "Empirical fitness models for hepatitis C virus immunogen design" Phys. Biol. 12 066006 (2015)

Sample I/O

>> HReconstr_Potts

Commencing semi-analytical Newton iterations...

Completed iteration 100 of 500

Completed iteration 200 of 500

Completed iteration 300 of 500

Completed iteration 400 of 500

Completed iteration 500 of 500

DONE!

Elapsed time is 332.817473 seconds.

>> HReconstr_Potts_trajAnalyzer

Documentation

HReconstr_Potts.m

INPUT FILES

MSA.mat - multiple sequence alignment nSeq sequences each comprising m amino acids denoted by single letter codes

h_in.dat - (optional) initial values of Potts model h parameters

J_in.dat - (optional) initial values of Potts model J parameters

OUTPUT FILES

MSA_numeric.mat - translation of MSA from single letter amino acid codes to integer codes

resIdx.dat - list of all amino acid residue integer codes observed in MSA at each position in decreasing order of prevalence

P1_target.dat - one-position amino acid frequencies for each observed residue at each position computed from MSA

P2_target.dat - two-position amino acid frequencies for each observed residue at each pair of positions computed from MSA

h_traj.dat - trajectory of Potts model h parameters over the course of model fitting

J_traj.dat - trajectory of Potts model J parameters over the course of model fitting

P1_traj.dat - trajectory of one-position amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling

P2_traj.dat - trajectory of two-position amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling

P1_targReg_traj.dat - trajectory of regularized one-position amino acid target frequencies over the course of model fitting

P2_targReg_traj.dat - trajectory of regularized two-position amino acid target frequencies over the course of model fitting

h_final.dat - final values of Potts model h parameters

J_final.dat - final values of Potts model J parameters

INPUT PARAMETERS

seed=200184; -- rng seed

beta=1; -- inverse temperature

nIter=500; -- # Newton iterations

nStep=1000; -- # MC steps per Newton iteration

regType=4; -- regularization type (1=L1, 2=L2, 3=pseudo-counts, 4=3+1, 5=3+2)

lambda_h=0.1; -- regularization paramter for L1 & L2 h reg (for 1=L1 adjust P1_target by sign(h)*1/nSeq*lambda_h, for 2=L2 adjust P1_target by 2*h*1/nSeq*lambda_h )

lambda_J=0.1; -- regularization paramter for L1 & L2 J reg (for 1=L1 adjust P2_target by sign(J)*1/nSeq*lambda_J, for 2=L2 adjust P2_target by 2*J*1/nSeq*lambda_J )

lambda_P1=0.0; -- regularization parameter for P1 psuedo-counts (for 3=pseudo-counts adjust P1_target{i} by lambda_P1/nRes(i) )

lambda_P2=0.1; -- regularization parameter for P2 psuedo-counts (for 3=pseudo-counts adjust P2_target{i,j} by lambda_P2/[nRes(i)*nRes(j)] )

gaugeFixh=1; -- gauge fix h_i{i}(1)=0

gaugeFixJ=1; -- gauge fix J_ij{i,j}(:,1)=J_ij{i,j}(1,:)=0

tol_MARG=1E-3; -- used for regType=3,4,5 (those involving pseudo-counts), the target probabilities are regularized with pseudo-counts once at the start of the run, unlike the adjustable regType=1,2 (L1,L2) h & J regularization, we iteratively converge the regularized P1 and P2 targets to assure marginal consistency and present a viable convergence goal

cntr_max_MARG=1000; -- maximum number of iterations in P2 marginal renormalization for consistency with P1 under regularization

gamma_h=0.5; -- softening parameter on h steps

gamma_J=0.5; -- softening parameter on J steps

maxNewtonStep_h=0.1; -- maximum permissible absolute value of h_i Newton steps, higher values are thresholded

maxNewtonStep_J=0.1; -- maximum permissible absolute value of J_ij Newton steps, higher values are thresholded

write_mod=10; -- iteration modulus for file writes

print_mod=10; -- iteration modulus for screen reports

loadhJ=0; -- load initial h and J values from h_in.dat and J_in.dat

NOTES

Variable number of amino acids in each position

Each position in the protein i is associated with a 21-element vector h_i specifying the contributions of each of the 20 amino acids plus the unknown residue (if the residue identity cannot be unambiguously determined by sequencing) to the fictitious Potts model energy, and each pair of positions i and j with a 21-by-21-element matrix J_ij specifying the contributions of pairwise epistatic interactions. To reduce computational and memory overhead, we allow the number of residues accessible at each position nRes(i) to reflect only those actually observed in the nSeq sequences comprising the MSA.

The one-body amino acid frequencies at each position are stored in a nRes(i)-element vector P1_i, and the two-body frequencies at each pair of positions in a nRes(i)-by-nRes(j)-element matrix P2_ij. The fitting procedure seeks to find the corresponding nRes(i)-element h_i vectors at each position, and nRes(i)-by-nRes(j)-element J_ij matrices at each pair of positions.

Gauge fixing

Due to the gauge invariance of the Potts Hamiltonian, we are at liberty to fix one element of each h_i vector and one row and column of each J_ij matrix. We implement this gauge fixing in the code by setting h_i[1] = 0 and J_ij[:,1]=J_ij[1,:]=0.

Newton descent

We perform multidimensional iterative Newton descent in {h_i,J_ij} to find the maximum likelihood (ML) / maximum a posteriori (MAP) estimate for the model parameters given the sequneces constituting the MSA. The ML estimate is obtained in the absence of a Bayesian prior (i.e., a maximum entropy or uniform prior), whereas the MAP estimate is achieved where an informative prior is selected (see below). By taking derivatives of the one- and two-body amino acid probabilities {P1_i,P2_ij} as a function of {h_i,J_ij}, the Potts Hamiltonian admits analytical expressions for the gradients required in the Newton search, and obviating the need for their numerical estimation.

Bayesian regularization

We stabilize the model fitting using two varieties of regularization (i) offline (modifying the target probabilities computed from the MSA) and (ii) online (stabilizing the algorithm adjusting the target probabilities as a function of h_i or J_ij). By introducing some bias into our model, we seek to reduce variance and accelerate convergence. (As such, the converged model may not precisely reproduce the {P1_i,P2_ij} computed from the MSA due to the introduction of this bias.) In practice, the strength of the regularization can be reduced to zero in a series of fitting runs each of which passes the terminal {h_i,J_ij} as the initial guess for a new run with reduced regularization coefficients.

  1. Offline. The {P1_i} values computed from the MSA may be augmented with pseudo-counts lambda_P1/nRes(i) and the {P2_ij} values by lambda_P2/[nRes(i)*nRes(j)]. Conceptually, this process adds some (fractional) extra observations to our MSA, reflecting our belief that particular mutations or pairs of mutations may be more likely than the data would suggest. This is particularly useful in moving values in {P1_i,P2_ij} away from zero, reflecting our understanding that these mutations may not really be infinitely improbable, but just sufficiently rare not to be observed in the finite number of sequences nSeq in the MSA.

  2. Online. We can further stabilize the fitting procedure by applying a Laplacian (L1) or Gaussian (L2) prior to the Bayesian inference procedure to penalize large magnitude values of {h_i,J_ij} and prevent them from exploding during the numerical fitting trajectory. In practice, this has the effect of modifying the target probabilities {P1_i,P2_ij} as a function of {h_i,J_ij} such that P1_i <-- P1_i + sign(h_i)*1/nSeq*lambda_h and/or P2_ij <-- P2_ij + sign(J_ij)*1/nSeq*lambda_J (L1 regularization), or P1_i <-- P1_i + 2*h_i*1/nSeq*lambda_h and/or P2_ij <-- P2_ij + 2*J_ij*1/nSeq*lambda_J (L2 regularization).

HReconstr_Potts_trajAnalyzer.m

INPUT FILES

h_traj.dat - trajectory of Potts model h parameters over the course of model fitting

J_traj.dat - trajectory of Potts model J parameters over the course of model fitting

P1_traj.dat - trajectory of one-position amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling

P2_traj.dat - trajectory of two-position amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling

P1_targReg_traj.dat - trajectory of regularized one-position amino acid target frequencies over the course of model fitting

P2_targReg_traj.dat - trajectory of regularized two-position amino acid target frequencies over the course of model fitting

P1_target.dat - one-position amino acid frequencies for each observed residue at each position computed from MSA

P2_target.dat - two-position amino acid frequencies for each observed residue at each pair of positions computed from MSA

OUTPUT FILES

h_traj.fig/jpg - trajectory of Potts model h parameters over the course of model fitting

J_traj.fig/jpg - trajectory of Potts model J parameters over the course of model fitting

P1_traj.fig/jpg - trajectory of one-position amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling

P2_traj.fig/jpg - trajectory of two-position amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling

P1_targReg_traj.fig/jpg - trajectory of regularized one-position amino acid target frequencies over the course of model fitting

P2_targReg_traj.fig/jpg - trajectory of regularized two-position amino acid target frequencies over the course of model fitting

P1_fit.fig/jpg - parity plot of one-position amino acid target frequencies computed from the MSA vs. one-position amino acid frequencies predicted by the Potts model averaged over the fitting trajectory

P2_fit.fig/jpg - parity plot of two-position amino acid target frequencies computed from the MSA vs. two-position amino acid frequencies predicted by the Potts model averaged over the fitting trajectory