agalma / NEWS

Full commit

* Hotfixes to correct errors in TUTORIAL and append the supermatrix FASTA file
  to the 'multalign' report.
* New 'supermatrix' pipeline can construct supermatrices by occupancy
  proportion. (#75)
* New 'multalign' pipeline uses MAFFT instead of MACSE for multiple alignment
  of translated protein sequences. The simultaneous alignment and translation
  approach originally implemented in Agalma can improve translations by
  accommodating frameshifts; however, mistakenly including fairly distant
  homologs or erroneous transcripts within clusters can result in overall poor
  translations and alignment of clusters.  The old multalign pipeline was
  renamed 'multalignx' where the 'x' stands for translated multiple alignment,
  since MACSE uses nucleotide alignments to infer translations. (#79)
* Improved the linkage between the phylogeny pipelines, so that the most recent
  and correct type of previous runs are identified by default. A previous run
  can be explicitly chosen with the --previous argument (now consistent across
  all the pipelines). (#85)
* Rewrote the 'assemble' pipeline to subsume the wrapper script, and
  run the various components of Trinity as separate stages within the pipeline.
  This provides finer grained resource usage and fixes some problems with
  robustness and memory use we were experiencing on our compute cluster. GNU
  parallel replaces ParaFly for both the quantify_graph and butterfly stages.
  Oases is no longer supported in 'assemble', but additional assemblers could
  be added in the future as variants on the 'assemble' pipeline, e.g.
  'assemble_oases'. (#87)
* The report for the 'supermatrix' report now includes a table of the
  percentage of genes present for each taxon. (#82)
* The regression tests are taking longer to run (30-40 minutes) and have been
  divided up into different levels. The default level (1) now runs in about
  (16-cores). Higher levels (2 or 3) provide more complete tests and are
  selected with 'agalma test X'. (#92)
* Added a histogram of mean quality scores to the 'sanitize' report. (#90)


* Improved parallelization of the blastx annotation in 'postassemble'. (#53)
* 'homologize' has a new mode for seeding the homology search with an existing
  set of genes, such as CEGMA or an previously computed supermatrix. Instead of
  performing an all-by-all homology search, transcripts are only aligned
  against the seed genes. (#56, #59)
* New parameter in 'genetree' to disable bootstrapping or change the threshold
  for filtering by mean bootstrap support. (#60)
* Added multi-node parallelism to 'multalign' and 'genetree' using GNU
  parallel. (#58, #61)
* 'postassemble' now performs protein translation (largest open reading frame
  with Transdecoder) and transcript quantification (with RSEM). The schema for
  the 'sequences' table was updated so that exemplars are now selected as the
  transcript with highest abundance in a locus, rather than by the earlier
  ad-hoc selection of the longest transcript in the locus. Exemplars are now
  chosen in 'homologize' (via 'database.load_seqs') and not in 'postassemble'.
  (#57, #63)
* New 'orthologize' pipeline provides an alternative phylogeny pipeline that
  directly infers orthologs using OMA. (#64)
* Sequence reduction plot in the phylogeny report has more detail: added
  sequence counts before and after 'homologize.mcl_cluster' and for each filter
  applied in 'multalign.refine_clusters'. (#70, #71)
* Fixed a mis-calculation in the overlap threshold applied in
  'homologize.parse_edges'. (#72)


* Added bootstraping to RAxML calls in the 'genetree.genetrees' stage, and a
  subsequent filtering stage that removes trees with low mean bootstrap
  support. (#43)
* Removed the auto-generated report at the end of 'transcriptome' and put the
  appropriate report commands in the TUTORIAL. (#51)
* Added report commands to the phylogeny section of the TUTORIAL. (#50)
* Fixed problems with 'tabular_report' that caused unneccessary rows and empty
  table cells. (#52)
* A new option '--nreads' for reducing the number of reads that 'sanitize'
  outputs. (#49)
* Modified 'load' to correctly validate external assemblies with IUPAC
  ambiguity codes. (#41)


* Added 'resource_report' and 'phylogeny_report' utilities.
* Additional reporting for phylogeny pipelines:
  o 'genetree' reports maximum likelihood tree when run on a supermatrix.
  o supermatrix image in 'multalign', ordered by most complete taxon and gene.
  o some histograms were changed to tables for small numbers of taxa.
* Updates to README and TUTORIAL:
  o Clarified that the Agalma-bundled SwissProt database only includes Metazoa.
  o Fixed overwrite of 'BIOLITE_RESOURCES' variable in TUTORIAL. (#24)
* 'homologize' now ignores bad BLAST hits, that seem to occur for query
  sequences longer than 10Kb and in which the original query id is lost in
  the output.
* Fixed bug with passing flags through to RAxML in 'genetree'. (#19)
* Removed a hard-coded minimum cluster size of 3 from 'multalign' and replaced
  with the 'min_taxa' value (which should never be less than 4).
* New mechanism to break up the expensive all-by-all tblastx in 'homologize',
  so that many smaller chunks can be run externally/concurrently, and read
  back into the pipeline. This feature is not yet tested and we plan to finish
  it in the 0.3.3 release.
* Fixed default RAxML model in genetree. (#9)
* New regression test feature 'agalma test' downloads and runs a small
  transcriptome and phylogeny example to verify correct installation and
  validate changes to the code base.
* Phylogeny pipelines can now pass a common ID with --id and they will
  intelligently find the appropriate output from earlier pipelines. Previously,
  numeric run IDs had to be passed between pipelines. This is demonstrated in


* Split off part of 'assemble' pipeline into a new 'postassemble' pipeline,
  that performs all post-assembly filtering, coverage analysis, and annotation.
  It can be run on external (non-Agalma) assemblies prior to load, although
  the exemplars stage needs to be skipped if the assembly does not have
  Oases-style headers.
* Removed the annotation stage from 'load' pipeline, since this is now
  provided by 'postassemble' for external assemblies.
* Updated TUTORIAL now has a more complete phylogeny section and includes
  estimates of resources requirements.
* bugfix: typo in 'agalma_database' key in default agalma.cfg
* bugfix: missing 'cd' command in ubuntu install script