issue with fasta headers printed in purge.pl output

Issue #11 resolved
Afif Elghraoui created an issue

I installed the current snapshot of master on our cluster and a user requested us to make the change below, which writes the fasta headers without the |'s in scripts/purge.pl. I'm told that otherwise, windowmaster fails because of the pipes in the fasta headers.

--- scripts/purge.pl~   2018-07-09 19:33:52.000000000 -0400
+++ scripts/purge.pl    2018-07-19 14:44:05.987487000 -0400
@@ -384,7 +384,7 @@
                     $id = $1;
                     close $OUT if ($OUT);
                     open $OUT, '>', "$MINCE_DIR/$id.fasta" or err("failed to open \"$MINCE_DIR/$id.fasta\" for writing");
-                    print $OUT $l;
+                    print $OUT ">$id\n";
                 } else {
                     err('failed to get id from fasta file');
                 }

Comments (5)

  1. Arang Rhie

    Hi Michael, I was asking Afif for this change. I am running purge_haplotigs on FALCON-Unzip primary contigs, and had the issue with |arrow in the contig names. The error message was:

    Error: (CObjReaderParseException::eInvalidID) '000000F|arrow' is not a valid local ID (m_Pos = 0)
    Error: (106.16) Application's execution failed (CObjReaderParseException::eInvalidID) '000000F|arrow' is not a valid local ID (m_Pos = 0)
    

    And with the fix and removing the intermediate files I could proceed to build the proper blastn db. Are there other steps where the original contig names are used? For example, I see suspects.fasta contains the |arrow part. Can you make the rest of the pipeline stable regardless of having |arrow in the contig name?

    I already have bam files and other steps gone through with the primary contig set for several genomes, and renaming the sequence names in the bam/fasta file wouldn't make much sense.

    Thanks, Arang

  2. Michael Roach repo owner

    Thanks for spotting this. I managed to break quite a few things with some recent updates. I think in the current dev version it should strip the |arrow |quiver suffixes wherever possible as they seem to just cause issues with other programs. I'll update the master branch with this bugfix. If you need to add the |quiver |arrow suffixes back on afterwards you should be able to do that with a one-liner: cat curated.fasta | sed 's/>\(.*\)$/\1|arrow/' > outfile.fasta

  3. Arang Rhie

    Thanks, Michael. I could proceed with manually fixing the suspect.fasta, I guess was created before the change. My genome finished with no errors with this fix.

  4. Log in to comment