Overview
Atlassian SourceTree is a free Git and Mercurial client for Windows.
Atlassian SourceTree is a free Git and Mercurial client for Mac.
README
Synopsis
Code

HReconstr_Potts.m  code to translate multiple sequence alignment (MSA) of viral protein strains into Potts spin glass model of fitness landscape by semianalytic Newton descent

HReconstr_Potts_trjAnalyzer.m  code to plot trajectory files written by HReconstr_Potts.m to assess model convergence
Input Files
 MSA.mat  toy multiple sequence alignment (MSA) containing 100 sequences of a 4residue viral protein
References

A.L. Ferguson, J.K. Mann, S. Omarjee, T. Ndung’u, B.D. Walker and A.K. Chakraborty “Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design” Immunity 38 606617 (2013)

K. Shekhar, C.F. Ruberman, A.L. Ferguson, J.P. Barton, M. Kardar, A.K. Chakraborty "Spin models inferred from patientderived viral sequence data faithfully describe HIV fitness landscapes" Phys. Rev. E 88 062705 (2013)

J.K. Mann, J.P. Barton, A.L. Ferguson, S. Omarjee, B.D. Walker, A.K. Chakraborty and T. Ndung’u "The fitness landscape of HIV1 gag: Advanced modeling approaches and validation of model predictions by in vitro testing" PLOS Comput. Biol. 10 8 e1003776 (2014)

G.R. Hart and A.L. Ferguson "Error catastrophe and phase transition in the empirical fitness landscape of HIV" Phys. Rev. E 91 032705 (2015)

G.R. Hart and A.L. Ferguson "Empirical fitness models for hepatitis C virus immunogen design" Phys. Biol. 12 066006 (2015)
Sample I/O
>> HReconstr_Potts
Commencing semianalytical Newton iterations...
Completed iteration 100 of 500
Completed iteration 200 of 500
Completed iteration 300 of 500
Completed iteration 400 of 500
Completed iteration 500 of 500
DONE!
Elapsed time is 332.817473 seconds.
>> HReconstr_Potts_trajAnalyzer
Documentation
HReconstr_Potts.m
INPUT FILES
MSA.mat  multiple sequence alignment nSeq sequences each comprising m amino acids denoted by single letter codes
h_in.dat  (optional) initial values of Potts model h parameters
J_in.dat  (optional) initial values of Potts model J parameters
OUTPUT FILES
MSA_numeric.mat  translation of MSA from single letter amino acid codes to integer codes
resIdx.dat  list of all amino acid residue integer codes observed in MSA at each position in decreasing order of prevalence
P1_target.dat  oneposition amino acid frequencies for each observed residue at each position computed from MSA
P2_target.dat  twoposition amino acid frequencies for each observed residue at each pair of positions computed from MSA
h_traj.dat  trajectory of Potts model h parameters over the course of model fitting
J_traj.dat  trajectory of Potts model J parameters over the course of model fitting
P1_traj.dat  trajectory of oneposition amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling
P2_traj.dat  trajectory of twoposition amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling
P1_targReg_traj.dat  trajectory of regularized oneposition amino acid target frequencies over the course of model fitting
P2_targReg_traj.dat  trajectory of regularized twoposition amino acid target frequencies over the course of model fitting
h_final.dat  final values of Potts model h parameters
J_final.dat  final values of Potts model J parameters
INPUT PARAMETERS
seed=200184;  rng seed
beta=1;  inverse temperature
nIter=500;  # Newton iterations
nStep=1000;  # MC steps per Newton iteration
regType=4;  regularization type (1=L1, 2=L2, 3=pseudocounts, 4=3+1, 5=3+2)
lambda_h=0.1;  regularization paramter for L1 & L2 h reg (for 1=L1 adjust P1_target by sign(h)*1/nSeq*lambda_h, for 2=L2 adjust P1_target by 2*h*1/nSeq*lambda_h )
lambda_J=0.1;  regularization paramter for L1 & L2 J reg (for 1=L1 adjust P2_target by sign(J)*1/nSeq*lambda_J, for 2=L2 adjust P2_target by 2*J*1/nSeq*lambda_J )
lambda_P1=0.0;  regularization parameter for P1 psuedocounts (for 3=pseudocounts adjust P1_target{i} by lambda_P1/nRes(i) )
lambda_P2=0.1;  regularization parameter for P2 psuedocounts (for 3=pseudocounts adjust P2_target{i,j} by lambda_P2/[nRes(i)*nRes(j)] )
gaugeFixh=1;  gauge fix h_i{i}(1)=0
gaugeFixJ=1;  gauge fix J_ij{i,j}(:,1)=J_ij{i,j}(1,:)=0
tol_MARG=1E3;  used for regType=3,4,5 (those involving pseudocounts), the target probabilities are regularized with pseudocounts once at the start of the run, unlike the adjustable regType=1,2 (L1,L2) h & J regularization, we iteratively converge the regularized P1 and P2 targets to assure marginal consistency and present a viable convergence goal
cntr_max_MARG=1000;  maximum number of iterations in P2 marginal renormalization for consistency with P1 under regularization
gamma_h=0.5;  softening parameter on h steps
gamma_J=0.5;  softening parameter on J steps
maxNewtonStep_h=0.1;  maximum permissible absolute value of h_i Newton steps, higher values are thresholded
maxNewtonStep_J=0.1;  maximum permissible absolute value of J_ij Newton steps, higher values are thresholded
write_mod=10;  iteration modulus for file writes
print_mod=10;  iteration modulus for screen reports
loadhJ=0;  load initial h and J values from h_in.dat and J_in.dat
NOTES
Variable number of amino acids in each position
Each position in the protein i is associated with a 21element vector h_i specifying the contributions of each of the 20 amino acids plus the unknown residue (if the residue identity cannot be unambiguously determined by sequencing) to the fictitious Potts model energy, and each pair of positions i and j with a 21by21element matrix J_ij specifying the contributions of pairwise epistatic interactions. To reduce computational and memory overhead, we allow the number of residues accessible at each position nRes(i) to reflect only those actually observed in the nSeq sequences comprising the MSA.
The onebody amino acid frequencies at each position are stored in a nRes(i)element vector P1_i, and the twobody frequencies at each pair of positions in a nRes(i)bynRes(j)element matrix P2_ij. The fitting procedure seeks to find the corresponding nRes(i)element h_i vectors at each position, and nRes(i)bynRes(j)element J_ij matrices at each pair of positions.
Gauge fixing
Due to the gauge invariance of the Potts Hamiltonian, we are at liberty to fix one element of each h_i vector and one row and column of each J_ij matrix. We implement this gauge fixing in the code by setting h_i[1] = 0 and J_ij[:,1]=J_ij[1,:]=0.
Newton descent
We perform multidimensional iterative Newton descent in {h_i,J_ij} to find the maximum likelihood (ML) / maximum a posteriori (MAP) estimate for the model parameters given the sequneces constituting the MSA. The ML estimate is obtained in the absence of a Bayesian prior (i.e., a maximum entropy or uniform prior), whereas the MAP estimate is achieved where an informative prior is selected (see below). By taking derivatives of the one and twobody amino acid probabilities {P1_i,P2_ij} as a function of {h_i,J_ij}, the Potts Hamiltonian admits analytical expressions for the gradients required in the Newton search, and obviating the need for their numerical estimation.
Bayesian regularization
We stabilize the model fitting using two varieties of regularization (i) offline (modifying the target probabilities computed from the MSA) and (ii) online (stabilizing the algorithm adjusting the target probabilities as a function of h_i or J_ij). By introducing some bias into our model, we seek to reduce variance and accelerate convergence. (As such, the converged model may not precisely reproduce the {P1_i,P2_ij} computed from the MSA due to the introduction of this bias.) In practice, the strength of the regularization can be reduced to zero in a series of fitting runs each of which passes the terminal {h_i,J_ij} as the initial guess for a new run with reduced regularization coefficients.

Offline. The {P1_i} values computed from the MSA may be augmented with pseudocounts lambda_P1/nRes(i) and the {P2_ij} values by lambda_P2/[nRes(i)*nRes(j)]. Conceptually, this process adds some (fractional) extra observations to our MSA, reflecting our belief that particular mutations or pairs of mutations may be more likely than the data would suggest. This is particularly useful in moving values in {P1_i,P2_ij} away from zero, reflecting our understanding that these mutations may not really be infinitely improbable, but just sufficiently rare not to be observed in the finite number of sequences nSeq in the MSA.

Online. We can further stabilize the fitting procedure by applying a Laplacian (L1) or Gaussian (L2) prior to the Bayesian inference procedure to penalize large magnitude values of {h_i,J_ij} and prevent them from exploding during the numerical fitting trajectory. In practice, this has the effect of modifying the target probabilities {P1_i,P2_ij} as a function of {h_i,J_ij} such that P1_i < P1_i + sign(h_i)*1/nSeq*lambda_h and/or P2_ij < P2_ij + sign(J_ij)*1/nSeq*lambda_J (L1 regularization), or P1_i < P1_i + 2*h_i*1/nSeq*lambda_h and/or P2_ij < P2_ij + 2*J_ij*1/nSeq*lambda_J (L2 regularization).
HReconstr_Potts_trajAnalyzer.m
INPUT FILES
h_traj.dat  trajectory of Potts model h parameters over the course of model fitting
J_traj.dat  trajectory of Potts model J parameters over the course of model fitting
P1_traj.dat  trajectory of oneposition amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling
P2_traj.dat  trajectory of twoposition amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling
P1_targReg_traj.dat  trajectory of regularized oneposition amino acid target frequencies over the course of model fitting
P2_targReg_traj.dat  trajectory of regularized twoposition amino acid target frequencies over the course of model fitting
P1_target.dat  oneposition amino acid frequencies for each observed residue at each position computed from MSA
P2_target.dat  twoposition amino acid frequencies for each observed residue at each pair of positions computed from MSA
OUTPUT FILES
h_traj.fig/jpg  trajectory of Potts model h parameters over the course of model fitting
J_traj.fig/jpg  trajectory of Potts model J parameters over the course of model fitting
P1_traj.fig/jpg  trajectory of oneposition amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling
P2_traj.fig/jpg  trajectory of twoposition amino acid frequencies predicted by the model over the course of model fitting evaluated by MC sampling
P1_targReg_traj.fig/jpg  trajectory of regularized oneposition amino acid target frequencies over the course of model fitting
P2_targReg_traj.fig/jpg  trajectory of regularized twoposition amino acid target frequencies over the course of model fitting
P1_fit.fig/jpg  parity plot of oneposition amino acid target frequencies computed from the MSA vs. oneposition amino acid frequencies predicted by the Potts model averaged over the fitting trajectory
P2_fit.fig/jpg  parity plot of twoposition amino acid target frequencies computed from the MSA vs. twoposition amino acid frequencies predicted by the Potts model averaged over the fitting trajectory