Single assembly

Issue #125 resolved
Former user created an issue

This is a very basic question, but I have not found a clear explanation. I wonder whether it will benefit from explanation in the manual.

If I have hundreds of samples and can only do single assemblies, I will have hundreds of sets of contigs. It makes sense that these should be binned separately then dereplicated.

However, I would still want to leverage coverage data from all samples for each individual binning process, I imagine. This means I need to align all sample fastqs to all sets of contigs. It wouldn't work to concatenate the contigs as then there would be competition for alignment between similar contigs from different samples.

Am I therefore committed to n^2 alignments to generate the coverage data required?

Thanks!

Andrew

Comments (11)

  1. Rob Egan

    Hello Andrew,

    We believe that co-assembly is the best approach to avoid the duplication of separate assemblies and retrieve low-abundance species shared across samples. A tool like mhm2 https://bitbucket.org/berkeleylab/mhm2/src/master/ might be able to assemble your large data set on the proper cluster.

    If you still cannot co-assemble, then the next best alternative would be to combine your single assemblies and deduplicate them: https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/dedupe-guide/

    Both of these strategies provides a single combined assembly upon which metabat can be run on.

    I believe that mapping N data sets to N separate assemblies is neither efficient nor effective.

  2. Andrew McArdle

    Thanks for your advice Rob, much appreciated.

    I would love to use metahipmer but with 1.5 Tb of reads it might not be feasible (I also recall doubting that I could get UPC working on our cluster). With relatively low complexity samples (throat swab) and the knowledge that species may be dominated by particular strains in each sample, I think co-assembly could create many hybrid genomes.

    I was thinking it might be possible to co-assemble the unassembled reads from all samples and in this way recover lower abundance genomes, accepting the risk of hybridisation.

    I like the idea of deduplicating the single assemblies, however it sounds like only contained contigs would be dropped. Overlapping contigs would be preserved, so reads aligning to an overlapping region would be (potentially arbitrarily) assigned to one or the other. Do you think it’s fair to expect that this should not bias coverage statistics markedly?

    Best wishes,

    Andrew

  3. Rob Egan

    Hi Andrew,

    MHM2 requires UPC++, not UPC, so that might be a bit easier to install on your cluster hardware.

    So if strain resolution and preservation is of high importance, and especially the strain of highest abundance in each sample, then single assemblies will likely be the best course. If there is sufficient coverage (>15x), you can be reasonably assured that the highest abundance genome strains were assembled well in each sample.

    A co-assembly will give you a mostly non-redundant reference to all the species in the whole project and will retrieve more of the species of lower abundance across all the samples, but will likely result in hybrid genomes for those species. co-assembling the unassembled reads is an idea worth pursuing, imo, since a hybrid genome is the best you can get for genomes with low abundance across the board.

    Deduplication runs the same risk of hybridization and strain-squashing as co-assembly has.

    The coverage statistics that metabat produces is fairly robust to strain differences as it imposes a high degree of %ID in order for a read’s coverage to be counted, but how it works it is still dominated by the underlying assembly metrics.

    Circling back to the original question of N^2 mappings where the total N is 1.5TB and you want to keep each sample assembly separate for the above reasons, it should be sufficient to map the reads from it and just a few other random samples (say 4-8), instead of the full N in order for metabat to work well. Then, if, after the first few rounds you notice that some of the bins result in a few distinct patterns of species abundances, use the samples from those plus 1-2 more random ones, until you have a small set of ‘reference’ samples that provide good-enough differential abundance and use that across the board for each of your N samples.

    … or invest in the co-assembly, bin that and then try to polish the resulting bins N times into the separate dominant strains from each sample… with the caveat that polishing will likely only fix SNPs and leave the co-assembled hybrid genome’s rearrangements, insertions and deletions differences from the sample's dominate strain intact.

    -Rob

  4. Andrew McArdle

    Thanks so much, this gives me plenty of food for thought! I like the suggestion of coverage statistics from a subsample and will work on that first.

    Andrew

  5. Zhong Wang

    Hi Andrew,

    I think Rob addressed most of your questions. The overlaps between contigs from different samples may not produce abnormal coverage statistics, if you instruct the mapper to randomly assign reads mapping to multiple locations. In your shoes, I would merge/dedup of the single assemblies into one, and then run metabat with all samples in a single binning experiment.

    Hope this helps.

    Cheers,

    Zhong

  6. Zhong Wang

    Andrew, after chatting with Rob yesterday, I realized that randomly assign reads won’t work for the duplicated regions. So the best option is still coassembly. People have been using a hacking way by simulating long reads from the merged assembly and re-assemble them. I don’t think there is any systematic effort to evaluate this approach. Anyway, not sure this is helpful but hope all the best.

  7. Andrew McArdle

    It is very helpful - in cutting-edge fields such as these answers are rarely totally straightforward!

  8. a.mcardle

    Just to add that after circling through co-assembly (metahipmer set up and still queuring for a 265 node job on our HPC!) I have come across VAMB (https://github.com/RasmussenLab/vamb) which offers a native workflow for binning of single assembly MAGs across multiple samples – definitely worth a look for me as I already have the single assembled contigs.

  9. Log in to comment