negative coverage?

Issue #166 open

nan wang created an issue 2023-11-25

Hi there,

I use the metabat2 with code

runMetaBat.sh -s 500000 ${CONTIG_FASTA} *_bwa.sort.bam

and I got a warning: “Negative coverage depth is not allowed for the contig S1Ccontig_5589, column 1: -2.32938e+08”. I checked the ${CONTIG_FASTA}.depth.txt file and noticed that the coverage of one sample on the contigs S1Ccontig_5589 is negative.

Not sure if this is related, for another binning (same ${CONTIG_FASTA} but different samples), there is another warning: [Warning!] Need to check where the average depth is greater than 1e+7 for the contig S1Ccontig_822, column 1. This is absurd since the coverage of these samples on other contigs are <1.

Do you have any idea how did that come and why did that happen?

Thank you!

Best regards,

Nan

Comments (10)

nan wang reporter
- edited description
- 2023-11-25T22:00:08+00:00
Rob Egan
- changed status to open
- 2023-11-27T17:22:10+00:00
Rob Egan
Hello Nan, I agree this is absurd and I have no idea what may be going on except, perhaps a malformed set of bam files. Can you share one of the bam files? or run the following to give me an idea what is happening? “samtools view xxx.bam | grep S1Ccontig_5589 > S1Ccontig_5589.txt; wc S1Ccontig_5589.txt ; head -100 S1Ccontig_5589.txt tail -1000 S1Ccontig_5589.txt”
- 2023-11-27T17:26:18+00:00

nan wang reporter

attached S1Ccontig_822.txt

  <div class="preview-container wiki-content"><!-- loaded via ajax --></div>
  <div class="mask"></div>
</div>

</div> </form>

2023-11-27T19:54:03+00:00

nan wang reporter
Hi Rob, Thank you for your reply!

I run the metabat again and met the same warning on a different contig… let me give a more detailed description of what’s going on.

I have two groups of samples. The first group has four samples, two of them were sequenced via MinION R9 flow cell, the rest were via R10. The second group has 5 samples via R9 and one via R10. R9 is known to be less accurate and yield less reads. Therefore, for each group, the samples were coassebled by the flow cells (i.e. for group one, there are two contigs files, one for R9 and one for R10; same for group 2). Then I merged the two contigs files for each group (The contigs IDs were renamed so that there’s no duplicate contig ID).

On Nov 25, what i did was that I provided all bam file to metabat, and noticed that less than 5% of reads from R9 flow cells were mapped to the contigs. And I also got the “negative coverage” for group 2 and the “high average depth” for group 1 (specifically, file BRC_barcode05_bwa.sort.bam).

Now, I run metabat separately for R9 and R10 flow cells, and lower the percent identity threshold for R9 to 90%. Now I got error for “negative coverage” for group 1 (BRC_barcode05_bwa.sort.bam: “[Warning!] Negative coverage depth is not allowed for the contig S1Ccontig_822, column 1: -2.12367e+08” ) and no suspicious warning for group 2.

Here is the wc S1Ccontig_822.txt: 601 9878 2339641 S1Ccontig_822.txt

I also learnt that GraphMB is a better tool for long read sequencing and i will try it. But since metabat was used by previous researchers for MinION, I’d still like to see how it goes with my data. Thank you a lot!

Best regards,

Nan
- 2023-11-27T19:54:35+00:00
Rob Egan
Hi Nan,

‌

Okay there is a lot going on here. There is definitely some sort of bug going on here but you are using MetaBAT in an untested and non-standard workflow.

Let me start with the fact that MetaBAT has not been validated to work well with reads that have high error rates. The reason why the %ID threshold is set to 97% is because any read that ambiguously maps to multiple scaffolds is not useful to differentiate based on abundance. So, in general, you will get better results with another tool (like GraphMB) specifically designed to work with long reads. You can also run MetaBAT without an abundance file to rely on the TNF differentiation alone.

If I understand correctly you combined two assemblies and just renamed contigs. This is not how co-assembly is supposed to work, because you will have a ton of duplicated and partially duplicated and overlapping-ly duplicated sequences that won't help in binning because the binner will get confused in all sorts of ugly ways. For a proper co-assembly, *all* the data needs to be presented to the assembler and you should use that single assembly to cluster into bins. At the very least you should run BBTools' dedup to remove the easier duplication scenarios from your merged assembly, but I strongly encourage you to make a proper co-assembly.

That all being said, the underlying issue that causes the negative depths to be returned is unknown and I'd like to fix it. At the moment I have no clue why depths like -2.12367e+08 could possibly be generated by 601 alignments, but I suspect something is funky in the way that BWA generated the sam/bam files which MetaBAT is mis-interpreting, or some assumptions in MetaBAT are grossly broken with this MinION data. FWIW, I recommend using minimap for mapping high-error rate rates.

If you could attach that S1Ccontig_822.txt file which you generated, I might be able to reproduce the bug here and diagnose what the problem is.

‌

-Rob
- 2023-11-29T01:00:41+00:00
nan wang reporter
Hi Rob,

Thank you for your comments on this pipeline! The only reason that I assemble the R9 and R10 reads is that co-assemble them confuses the assembler as they have significant different reads quality and sequencing depth. I guess BBTool - dedup would be my best option. And I will definitely try GraphMB. Thank you again!

As for the S1Ccontig_822.txt file, I uploaded it last time but it didn’t come with the last email but presented as a separate attachment. I will attach it again here to make sure you see it. Thank you for looking into it!

Best regards,

Nan
- 2023-11-29T01:26:27+00:00

nan wang reporter

attached S1Ccontig_822.txt

  <div class="preview-container wiki-content"><!-- loaded via ajax --></div>
  <div class="mask"></div>
</div>

</div> </form>

2023-11-29T01:28:38+00:00

Rob Egan
Hi Nan,

Sorry for the lack of follow up here. I cannot reproduce this issue, with just the S1Ccontig_822.txt that you provided. No matter how I mock up a SAM file using that as a base, I always get 0.0846287 and 0.0774876 as the depth and variance, assuming that the length of S1Ccontig_822 is >=5940…. I didn’t know the actual length and if it smaller the program gives a bunch of warnings (because the CIGAR doesn’t match and much smaller and the program core dumps). If I make it much larger, then the depth gets much smaller as expected.

At this point I’m at a loss as to how I can replicate this issue, unless you are willing to share one of the bam files.

-Rob
- 2024-05-08T18:09:35+00:00
Rob Egan
Okay, I may have figured out what is going on, or at least what may be contributing to these erroneous depths. When we calculate the depths, we ignore the edge bases because mappers tend to bias against maps along the edges, and so bias the depth to be low. When long read technologies are used, they can have very long reads and so the edge calculations were not properly capping the maximum length of the edges. I made a branch that should better handle this logic Issue166 and added warnings that will print if this ever crops up again.

I’ll push this updated code to master presently.
- 2024-05-08T21:43:04+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: open

Votes: 0

Watchers: 1