Add the value (nameCol) in a given BED file to the FASTA header.

#239 Merged
Repository
galaxy-central
Branch
default
Author
  1. Björn Grüning
Reviewers
Description

This PR will add the value in the BED name column to every FASTA header. GFF3 behaviour is not changed.

Comments (7)

  1. John Chilton

    I would be happy to dig in, test, and merge this if someone who uses this tool ( @dan @jgoecks @jen ?) okays the change at a high-level. Seem entirely reasonable to me, but what do not know much about genomic formats, etc...

  2. Jeremy Goecks

    Seems reasonable to me. One thing to keep in mind, though, is that putting a space b/t the identifier and name means that the name is not a part of the sequence ID because, per the FASTA specification, the ID ends after the first space. IMO, it's more useful to include the name as another field in the ID: e.g., chrom_start_end_strand_name

  3. Björn Grüning author

    Hi @jgoecks,

    Any reason to include it into the ID. We would use it as some kind of description of the sequences and that dos not need to be part of the ID. For example cmsearch will use it only if its part of the description and not part of the ID. Moreover, chrom_start_end_strand should be unique or?

    Thanks for reviewing @jgoecks & @jmchilton

    P.S. we missed you @ GCC hope to see you next year :)

  4. Jeremy Goecks

    Hmm, I was thinking that some tools would require it to be in the ID rather than the description. Hard to say what the right thing to do is; let's start simple and leave as is, putting the name in the description. Unique IDs would be nice, but I don't see how including/excluding name changes that.

  5. John Chilton

    4 of the existing test cases fail when this change is applied.

    Example Errors:

    Traceback (most recent call last):
      File "/afs/galaxyproject.org/user/jmchilton/galaxy-central-239/tools/extract/extract_genomic_dna.py", line 300, in <module>
        if __name__ == "__main__": __main__()
      File "/afs/galaxyproject.org/user/jmchilton/galaxy-central-239/tools/extract/extract_genomic_dna.py", line 255, in __main__
        if name.strip():
    UnboundLocalError: local variable 'name' referenced before assignment
    
    
    Traceback (most recent call last):
      File "/afs/galaxyproject.org/user/jmchilton/galaxy-central-239/test/functional/test_toolbox.py", line 171, in test_tool
        self.do_it( td, shed_tool_id=shed_tool_id )
      File "/afs/galaxyproject.org/user/jmchilton/galaxy-central-239/test/functional/test_toolbox.py", line 102, in do_it
        self.verify_dataset_correctness( outfile, hid=elem_hid, maxseconds=testdef.maxseconds, attributes=attributes, shed_tool_id=shed_tool_id )
      File "/afs/galaxyproject.org/user/jmchilton/galaxy-central-239/test/base/twilltestcase.py", line 855, in verify_dataset_correctness
        raise AssertionError( errmsg )
    AssertionError: History item 2 different than expected, difference (using diff):
    ( /afs/galaxyproject.org/user/jmchilton/galaxy-central-239/test-data/extract_genomic_dna_out2.fasta v. /tmp/tmpRBZBfV/tmpVKK_KA/new_files_path_6j6KHi/tmposwym_extract_genomic_dna_out2.fasta )
    --- local_file
    +++ history_data
    @@ -1,6 +1,6 @@
    ->droPer1_super_1_139823_139913_-
    +>droPer1_super_1_139823_139913_- AK028861
     CGTCGGCTTCTGCTTCTGCTGATGATGGTCGTTCTTCTTCCTTTACTTCT
     TCCTATTTTTCTTCCTTCCCTTACACTATATCTTCCTTTA
    ->droPer1_super_1_156750_156844_-
    +>droPer1_super_1_156750_156844_- BC126698
     CCGGGCTGCGGCAAGGGATTCACCTGCTCCAAACAGCTCAAGGTGCACTC
     CCGCACGCACACGGGCGAGAAGCCCTATCACTGCGACATCTGCT
    

    Are these easily addressable? I know these test cases are hard to setup, if you have fix you believe will work feel free to update the pull request and I can rerun the tests.

  6. John Chilton

    < rossl> jmchilton: on those failing tests after adding the extra bed column - maybe the outputs in test-data just need to be updated to match - probably no need to change the tool itself? (random AU2c worth) < jmchilton> rossl: I think that is half true. There is also a logic error in there that needs to be corrected though ('name' used before defined).