maf scripts

jason_sahl avatarjason_sahl created an issue

Hi, I'm trying to filter a maf file with the scripts provided. What I'd like to do is filter a maf such that only blocks that only contain sequence from a listed set of species are returned. The maf_limit_to_species.py script looks promising, but when I run:

maf_limit_to_species.py species1,species2,species3 < maf > newmaf

I get blocks that contain only species1 and species 2, but not species 3, for example. Is there another script or a modification to this script that I can use to only return blocks that only contain sequence from all listed species?

thanks for your help, Jason

Comments (11)

  1. James Taylor

    I believe 'maf_thread_for_species.py' will do what you want. It gives blocks that contain all of the requested species (and attempts to fuse blocks).

  2. jason_sahl

    Thanks for your reply, James. That script looks like it returns blocks if they contain all of the species that I list, but the blocks may contain sequences from other species that I'm not interested in. I'm interested in looking for unique blocks between a subset of species in a large maf alignment. I'm sure that it's possible with the bx-tools, but I didn't want to reinvent the wheel needlessly.

    thanks,

    Jason

  3. Bob Harris

    Try something like this, by replacing the core of maf_limit_to_species.py with the following. I have not tested this, but it seems pretty straightforward.

    requiredSpecies = sys.argv[1].split( ',' )

    maf_reader = bx.align.maf.Reader( sys.stdin ) maf_writer = bx.align.maf.Writer( sys.stdout )

    for m in maf_reader: new_components = [] speciesPresent = {} for comp in m.components: species = comp.src.split( '.' )[0] if species in requiredSpecies: speciesPresent[species] = True new_components.append( comp )

    noneMissing = True for species in requiredSpecies: if species not speciesPresent: noneMissing = False break

    if noneMissing: m.components = new_components maf_writer.write( m )

    maf_reader.close() maf_writer.close()

  4. jason_sahl

    Thanks for your help- I ran the script and the output looks to be the same as what I get from running maf_thread_for_species.py. I think I just need to add something to skip blocks entirely if they contain other species than those passed in as arguments. I'll have to look again at the maf_writer.

    thanks, Jason

  5. Bob Harris

    Sorry, I didn't fully understand what you were after (mainly because I used the ready-fire-aim design paradigm). The attached snippet should give you all the blocks that have every species you specify, and no others. As before, I have not tested it at all.

    Bob H

  6. jason_sahl

    Hi Bob. I downloaded the snippet, but it looks to be the same snippet as you sent previously. I pasted it into the script and it also behaves the same.

    thanks, Jason

  7. Anonymous

    Sorry, apparently I didn't save my changes before I posted the updated version. Now, for some reason bitbucket isn't letting me add an attachment. But I stumbled upon the "code" button. So here is the what the whole script would look like:

    """
    Read a maf file from stdin and write out a new maf with only those blocks
    having all of the required in species and no others.
    
    NOT TESTED AT ALL
    
    usage: %prog species,species2,... < maf
    """
    
    import psyco_full
    
    import bx.align.maf
    import copy
    import sys
    
    from itertools import *
    
    def main():
    
        requiredSpecies = sys.argv[1].split( ',' )
    
        maf_reader = bx.align.maf.Reader( sys.stdin )
        maf_writer = bx.align.maf.Writer( sys.stdout )
    
        for m in maf_reader:        
            speciesPresent = {}
            new_components = []    
            othersPresent = False
            for comp in m.components:
                species = comp.src.split( '.' )[0]
                if species in requiredSpecies:
                    speciesPresent[species] = True
                    new_components.append( comp )
                else:
                     othersPresent = True
            if othersPresent: continue
    
            noneMissing = True
            for species in requiredSpecies:
                if species not in speciesPresent:
                    noneMissing = False
                    break
    
            if noneMissing:
                m.components = new_components
                maf_writer.write( m )
            
        maf_reader.close()
        maf_writer.close()
    
    if __name__ == "__main__": 
        main()
    
  8. Anonymous

    You can probably improve performance by adding a break after the "othersPresent = True" line. Similar to the one after noneMissing = False. No need for the loop to complete in those cases.

    Bob H

  9. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.