This PR will add the value in the BED name column to every FASTA header. GFF3 behaviour is not changed.
I would be happy to dig in, test, and merge this if someone who uses this tool ( @dan @jgoecks @jen ?) okays the change at a high-level. Seem entirely reasonable to me, but what do not know much about genomic formats, etc...
Seems reasonable to me. One thing to keep in mind, though, is that putting a space b/t the identifier and name means that the name is not a part of the sequence ID because, per the FASTA specification, the ID ends after the first space. IMO, it's more useful to include the name as another field in the ID: e.g., chrom_start_end_strand_name
Any reason to include it into the ID. We would use it as some kind of description of the sequences and that dos not need to be part of the ID. For example cmsearch will use it only if its part of the description and not part of the ID. Moreover, chrom_start_end_strand should be unique or?
Thanks for reviewing @jgoecks & @jmchilton
P.S. we missed you @ GCC hope to see you next year :)
Hmm, I was thinking that some tools would require it to be in the ID rather than the description. Hard to say what the right thing to do is; let's start simple and leave as is, putting the name in the description. Unique IDs would be nice, but I don't see how including/excluding name changes that.
4 of the existing test cases fail when this change is applied.
Are these easily addressable? I know these test cases are hard to setup, if you have fix you believe will work feel free to update the pull request and I can rerun the tests.
< rossl> jmchilton: on those failing tests after adding the extra bed column - maybe the outputs in test-data just need to be updated to match - probably no need to change the tool itself? (random AU2c worth)
< jmchilton> rossl: I think that is half true. There is also a logic error in there that needs to be corrected though ('name' used before defined).