UPDATED FOR v1.0.3
Tutorial: Curate the FALCON-Unzip Clavicorona pyxidata assembly
Follow the instructions for setting up Purge Haplotigs if you haven't already done so: Detailed Install
The files for this tutorial can be found here:
Download and extract the files for this tutorial
The preparation has already been done (reads mapped to the primary contigs and sorted), so you only need to download the BAM and FASTA files.
# create a directory for the tutorial and navigate to it mkdir purge_haplotigs_tutorial cd purge_haplotigs_tutorial # download the BAM alignments and the Primary contigs wget https://zenodo.org/record/841398/files/cns_p_ctg.aligned.sd.bam wget https://zenodo.org/record/841398/files/cns_p_ctg.aligned.sd.bam.bai wget https://zenodo.org/record/841398/files/cns_p_ctg.fasta wget https://zenodo.org/record/841398/files/cns_p_ctg.fasta.fai
This first script will run a coverage analysis over your BAM file and use the output to generate a read-depth histogram. You can increase the parallelization of this stage by setting
-threads, however, the disk IO is likely to be the bottleneck.
# Run the first step purge_haplotigs readhist -b cns_p_ctg.aligned.sd.bam -g cns_p_ctg.fasta
Take a look at the histogram .png file (It should look like the one below), you should have two peaks for 'haploid' and 'diploid' level of coverage. Where there are homologous contigs in the haploid assembly the reads will be split between the two contigs resulting in approximately half the read-depth for those contigs. In a perfect world you would have no homologous primary contigs and the histogram will show only one 'diploid' peak.
Choose some read-depth cutoffs for low read depth, the midpoint between the 'haploid' and 'diploid' peaks, and high read-depth cutoff (I've chosen low cutoff = 10, midpoint = 70, high cutoff = 190).
This step will use the cutoffs you've chosen and the read-depth information in the previous step's coverage output file to flag 'suspect' and 'junk' contigs. Contigs are 'suspect' if they are mostly outside of the 'diploid' level of coverage, and are 'junk' if they are mostly abnormal coverage (very low or high). The 'junk' contigs will be removed from the assembly whereas the 'suspect' contigs will be analysed in the third step. You can change the threashold that contigs are flagged as suspect or junk with the
-j flags. You can also turn off junking contigs by setting
-j to a number greater than 100, i.e.
# Run the second step purge_haplotigs contigcov -i cns_p_ctg.aligned.sd.bam.gencov -l 10 -m 70 -h 190
This step will perform a number of operations to identify and remove haplotigs and may take a while to complete. If you want to generate dotplots, the first stage will be parallel bedtools genomecov runs over genome windows; this read-depth trace will be juxtaposed with dotplots. The second stage will perform minimap2 alignments of the suspect contigs to all the other contigs. The next stage is the iterative contig purging. Each iteration the pipeline will run pull the alignments for each suspect contig's best hit contigs, analyse the alignments to determine if they satisfy the conditions for reassigning as a haplotig and check for conflicts before reassigning. The final stage will double-check all reassigned contigs to ensure they still satisfy the conditions for reassignment as a haplotig.
The two alignment scores that are calculated are:
- alignment_coverage: This is the % of the contig that aligns to its top two hit contigs. This determines whether or not a contig is flagged for reassignment (either as a haplotig or a repetitive contig).
- max_match_coverage: This is the total % of alignments between the suspect contig and its two top hit contigs (e.g. if it totally aligns to both the hit contigs then the max_match_coverage will be ~ 200 %). This is mostly depreciated, especially if you supply repeat annotations, however, it may be useful in identifying repeat-rich contigs.
-a flag is the cutoff for flagging contigs for reassignment, and the
-m flag is the cutoff for labeling reassigned contigs as repetitive. For this example I'm using the default 4 threads and the default -a 70 cutoff.
# Run the third step purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv
The pipeline is highly reentrant; if you wish to rerun the
purge_haplotigs purge step with different parameters it wont re-run the coverage analysis (if generating dotplots) or minimap2 alignments, and will generally finish much faster than the original run. Try re-running at a different
purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv -a 60
Now try rerunning, this time to produce dotplots of the final alignments (need to use the
-d flag and pass the bam file with the
purge_haplotigs purge -g cns_p_ctg.fasta -c coverage_stats.csv -a 60 -d -b cns_p_ctg.aligned.sd.bam
The pipeline will give you a number of output files to help you in understanding the reassignments and to help you in further curating your assembly.
curated.fasta and curated.haplotigs.fasta: These are your 'curated' assembly files with
curated.fasta being the new haploid assembly, and
curated.haplotigs.fasta the reassigned contigs. As mentioned above, if you run the pipeline on the concatenated FALCON_unzip primary contigs + haplotigs then the curated diploid assembly will be
curated.haplotigs.fasta. However if you run the pipeline on just the FALCON_unzip primary contigs then you will need to add the FALCON_unzip haplotigs onto
curated.haplotigs.fasta for the diploid assembly.
curated.artefacts.fasta: These are the contigs that were flagged as 'junk' based on the read-depth information. The assumption is that if the reads cannot map well to the contig then they are likely artefactual contigs, and if the reads map at a depth much greater than the rest of the assembly then they are likely to be organelle contigs or collapsed repetitive elements. You may wish to keep organelle or repetitive contigs, or you might choose to turn off this feature altogether.
curated.reassignments.tsv: This is a tab-separated table file that you can open in a spreadsheet program. This file lists all the reassignments that were made as well as the two alignment scores used to determine the reassignments. It also lists the contigs that were kept and the contigs that were flagged earlier as 'junk' based on the read-depth information.
curated.contig_associations.log: This file shows the purging order for the contig reassignments. Take the example below:
000062F,PRIMARY -> 000071F,HAPLOTIG -> 000072F,HAPLOTIG -> 000075F,REPEAT -> 000086F,HAPLOTIG
Here 000086F and 000071F were identified as haplotigs of 000062F. Contigs 000072F and 000075F were identified as haplotigs of 000071F. NOTE: the alignments aren't necessarily nested, but this file does serve as an easy reference for further investigations (i.e. you could concatenate these four haplotigs and run them against the primary contig using mummer, delta-filter, and mummerplot to see if/how much the haplotigs are nested).
dotplots_reassigned_contigs and dotplots_unassigned_contigs: If you use the
-dotplots flag and supply the BAM file, then these directories contain the dotplots for all of the reassigned and unassigned contigs respectively. The plots themselves are actually two dotplots (top two plots) juxtaposed with a read-depth trace. The unassigned dotplots directory will not include contigs that don't have any significant alignments to other primary contigs at the end of purging.
In the example below, the haplotig 000055F mostly aligns to 000026F. Part of the haplotig however does not, and the read-depth for this region is higher. This suggests that the first 25 kb or so might be haplotype-fused (represent both alleles) and in a perfect world, should be included in the haploid assembly. There is a trade-off between genome completeness and de-duplication with diminishing returns. Whether you leave it as a haplotig or put it back in with the primary contigs is up to you.