Meraculous-2D Genome Assembler Pipeline Software v.2.2.6
Eugene Goltsman, Isaac Ho, Jarrod Chapman, Steven Hofmeyr
Meraculous is a whole genome assembler for Next Generation Sequencing data, geared for large genomes. It's hybrid k-mer/read-based approach capitalizes on the high accuracy of Illumina sequence by eschewing an explicit error correction step which we argue to be redundant with the assembly process. Meraculous achieves high performance with large datasets by utilizing lightweight data structures and multi-threaded parallelization, allowing to assemble human-sized genomes on a high-cpu cluster in under a day. The process pipeline implements a highly transparent and portable model of job control and monitoring where different assembly stages can be executed and re-executed separately or in unison on a wide variety of architectures.
For more information see doc/meraculous/Manual.pdf
v. 2.2.6 * Many improvements to diploid_mode 2, including local depth ratio-based detection of heterozygous variant splits.
v. 184.108.40.206 * Fixed bug in bubble_finder that caused bubble haplotyping to be disabled. PREVIOUS VERSION SHOULD NO LONGER BE USED!
* SLURM support added. Cluster environment now automatically detected. Supported cluster sw: SGE/UGE, SLURM. * Compatibility updates for gcc compiler versions up to 7.1, boost 1.63.0
* In diploid_mode 2, updated the logic for detection of homozygous (2x) peak depth in meraculous_ono when it's overshadowed by the heterozygous (1x) peak. Now relies on the bubble_depth_threshold parameter. * Relaxed criteria for inserting "suspended" objects during meraculous_ono. * In diploid_mode 2, now filtering out diplotigs whose overall bubble-depth is less than 20% of that in the alternative diplotig "sister" (diplotigX_p1 & diplotigX_p2) , which suggests an error-induced bubble. * In diploid_mode 1, the option 'gap_close_aggressive' will, in addition to it's prior functionality, treat alignments to alternative diplotig "sisters" as mutually applicable during gap closure. This increases the likelyhood of haplotype "cross-over" in gap-filling sequence but should help close more gaps.
* Added a lib_seq flag to specify a down-sampling rate for a given library. Use in case when coverage is excessive and is putting unnecessary burden on the resuources. This flag replaces the rarely used "3-prime wiggleroom" flag as the last argument on the lib_seq line. Set to 0 to disable downsampling (required)
* Addedd new logic and heurisics to scaffolding algorithms in in diploid_mode 2. Haplotype resolution is now more robust in the presense of repeats. Scaffolding rounds are now aware of prior rounds' resolution of diploid-haploid tie collisions.
* Added a new user parameter mergraph_depth_pct_cutoff which we recommend to use with metagenomic assemblies. K-mer extension candidates' counts are evaluated as percentage of all candidates' counts combined. This allows to treat low-abundance organisms with a lower minimum depth requirement than high-abundance members. Note that min_depth_cutoff is still valid and serves as the hard "floor".
* IMPORTANT!! Reworked the original diploid_mode 1 to be more logically consistent with the mode 2. We no longer create mosaic pseudo-haploid diplotigs. Instead, jsut as in mode 2 (Dual Haplopaths), we create *pairs* of internally haplotype-consistent diplotigs. Both variants are preserved and are ouput as diplotigX_p1 and diplotigX_p2 at the end of the _bubble stage. To avoid a complexity explosion in the scaffold graph we transform links that stem from _p2 diplotigs into links of the corresponding _p1 variant. The initial scaffolds thus have only _p1 diplotigs incorporated while the _p2 variants exist as unlinked singleton contigs. We then swap in the correct variant based on an earlier determined clone-based linkage across diploid bubbles. A set of scaffolds representing both haplotypes is reported. Unlike diploid_mode 2, however, only one variant ends up in long scaffolds while the alternative variant ends up as a loose single-contig scaffold. The final final.scaffolds.fa is a single-haplotype version of the final assembly, with the lose alt variants taken out. This mode is still only suitable for low-polymorphism diploids ( SNP rate < 1/k ). * Fixed a bug that resulted in a crash in _gap_closure when bubble_depth_threshold was set to 1
* Added a major new feature to resolve diploid assembly ambiguities (bubbles in graph & linkage conflicts). The new model, termed Dual Haplopaths, is suitable for highly polymorphic genomes (SNP rates of over 1%) and relies on mapping read pairs to bubble-contigs for maintaining consistent haplotype sequence throughout the resulting diplotig (i.e. prevent haplotype crosover). At the end, BOTH haplotypes are represented during scaffolding and in the final assembly which should come out at roughly twice the size of the actual diploid genome. Contigs representing a non-polumorphic region of the genome can be duplicated in the scaffolding stage, so the end result may contain non-unique contigs, but not non-unique scaffolds. The original method where the haplotypes are "squashed" in a mosaic fashion is still available, and is recommended for diploid genomes with low polymorphism. To select between the two modes, user parameter 'diploid_mode' should be set to 1 for the old "pseudo-haploid" model or 2 for the Haplopaths model. Setting 'diploid_mode' to 0 or omitting it altogether will default to the haploid assembly. Note that chosing the Haplopaths model significantly increses the run time for the _bubble stage as it involves read mapping. The old parameter 'is_diploid' has been deprecated. * Added a feature to the _import stage to fall back on the original fastq file when the sub-sampling produces too little data for prefix balancing and basic input statistics. * The automated min_depth detection algorithm (findDmin.pl) was modified to pick a cutoff closer to the first min through of the kmer depth distribution.
* Fixed a bug in meraculous_contigs where running on a high number of threads could result in incomplete UFX info when building UUtigs * Added a user option 'cluster_num_nodes' which determines how various data is partitioned for parallel execution on a cluster. Combined with parameters 'cluster_ram_request' and 'cluster_slots_per_task', this can be used to control the granularity of job sets. If this parameter is left out the default behavior is to assume a single node environment, therefore partitioning only to fit under the available memory limit (also set by the user). The stages affected by this are meraculous_merblast, meraculous_ono, and meraculous_gap_closure. * Optimized gap placement (gapPlacer.pl) and gap closure algorithms (merauder, now a compiled c binary).
* Fixed a bug in gap_closure where unaligned mates of anchored reads were projected into gaps redundantly. * Fixed a bug in gap_closure where if a mix of Illumina 1.5 and 1.8-style reads was used, the latter would not be processed properly. * Restructured the source tree so that the directory names are not redundant with the names in the build in case the user wants to build/install in-source * In the installer, added -lpthreads to the list of linker flags. This is required if the user has a Boost installation with pthreads enabled
* k-mer size is now auto-selected based on genome size. Manual override still works via the mer_size parameter. * Added a user option 'fallback_on_est_ins_size' which tells Meraculous to rely on the original insert size estimates if not enough read pairs map to the UUtigs to produce a more accurate assembly-based calculation. Typically this is a sign that something is wrong with the data, so this workaround should *not* be used routinely. * If resuming from the middle of the import stage, will skip libraries that have already been processed. * Parametes local_num_procs and cluster_slots_per_task can no longer be changed between stages since they're linked to the way certain outputs are partitioned. * User parameters meraculous_mer_size and meraculous_min_depth_cutoff renamed to 'mer_size' and 'min_depth_cutoff'. * Removed the Bio::Seq Perl dependency
* Contigs are now named in a reproducible manner, i.e. if assembly is rerun with the same inputs/parameters, the output should be identical * Removed BioPerl dependencies to facilitate installation at external sites
* Fixed the bug where merblast output was being deleted prematurely with -cleanup_level 2 * Fixed the bug in merBlast where not all contig-splinting alignmens were reported * Fixed the bug in the ordering of the threaded merblast outputs * Disabled the logic to determine the number of chunks to split the fastqs based on num_local_procs if not using cluster. It limited user to that number of chunks even if they switch to another machine or cluster later. * Added auto-detection of the bubble_depth_cutoff. Users no longer need to stop in the middle of a diploid run to determine it. Manual setting is still supported and will turn off auto-detect * Scaffolding now uses both splinting and spanning reads in the same round. The two types can add complimentary information and as a result reduce misjoins. * Inside scaffolds, contig ends are trimmed back if the *estimated* gap size between them is less than 10bp. This is done to avoid overlapping redundant ends. * Added a helper script boostrap_run.sh that allows creating a pseudo-run structure based on another run up to a specified stage. User can then create a new "branch" by running meraculous.pl with the -resume and -dir options pointing to the new run structure. This allows creating multiple assembly versions that have common initial stages (e.g. different scaffolding settings starting with the same UUtigs)