issue with fasta headers printed in purge.pl output
I installed the current snapshot of master on our cluster and a user requested us to make the change below, which writes the fasta headers without the |
's in scripts/purge.pl
. I'm told that otherwise, windowmaster
fails because of the pipes in the fasta headers.
--- scripts/purge.pl~ 2018-07-09 19:33:52.000000000 -0400
+++ scripts/purge.pl 2018-07-19 14:44:05.987487000 -0400
@@ -384,7 +384,7 @@
$id = $1;
close $OUT if ($OUT);
open $OUT, '>', "$MINCE_DIR/$id.fasta" or err("failed to open \"$MINCE_DIR/$id.fasta\" for writing");
- print $OUT $l;
+ print $OUT ">$id\n";
} else {
err('failed to get id from fasta file');
}
Comments (5)
-
-
repo owner Thanks for spotting this. I managed to break quite a few things with some recent updates. I think in the current dev version it should strip the |arrow |quiver suffixes wherever possible as they seem to just cause issues with other programs. I'll update the master branch with this bugfix. If you need to add the |quiver |arrow suffixes back on afterwards you should be able to do that with a one-liner:
cat curated.fasta | sed 's/>\(.*\)$/\1|arrow/' > outfile.fasta
-
Thanks, Michael. I could proceed with manually fixing the suspect.fasta, I guess was created before the change. My genome finished with no errors with this fix.
-
repo owner That's good to hear.
-
repo owner - changed status to resolved
- Log in to comment
Hi Michael, I was asking Afif for this change. I am running purge_haplotigs on FALCON-Unzip primary contigs, and had the issue with |arrow in the contig names. The error message was:
And with the fix and removing the intermediate files I could proceed to build the proper blastn db. Are there other steps where the original contig names are used? For example, I see suspects.fasta contains the |arrow part. Can you make the rest of the pipeline stable regardless of having |arrow in the contig name?
I already have bam files and other steps gone through with the primary contig set for several genomes, and renaming the sequence names in the bam/fasta file wouldn't make much sense.
Thanks, Arang