maf_extract_ranges_indexed fails on blocks non-unique reference

Issue #18 resolved
Anonymous created an issue

(I am Bob H, I don't seem to be able to login, otherwise I would make this change in the source code).

maf_extract_ranges_indexed fails when a maf block contains more than one line with the desired reference sequence. The failure manifests as an abort, from the range error test in coord_to_col() in align/core.py.

What happens is that the index creation program records such a block once for each time the reference species occurs in the block (recording it at different intervals). When maf_extract_ranges_indexed asks for all the blocks intersecting a given interval of interest, it (correctly) is given all such blocks. But the call to get_component_by_src() only returns the first component, and that component may not be the one that actually intersects the interval of interest.

A stopgap change is to add a test for an empty interval after slice_start and slice_end are calculated (the third line shown below):

{{{

!python

            slice_start = max( start, ref.get_forward_strand_start() )
            slice_end = min( end, ref.get_forward_strand_end() )
            if (slice_end <= slice_start): continue

}}}

This is not necessarily the right solution, but it does prevent the failure.

Comments (6)

  1. Anonymous

    (Bob H again)

    A better fix (probably) would be to add a Alignment.get_components_by_src() method to core.py:

        def get_components_by_src( self, src ): # see change Apr/21/2010
            for c in self.components:
                if c.src == src: yield c
    

    Then change maf_extract_ranges_indexed.py so that it handles all references:

                    #ref = block.get_component_by_src( src )
                    for ref in block.get_components_by_src( src ):
                        slice_start = max( start, ref.get_forward_strand_start() )
                        #...
                        #the rest of the for block in block loop indented to be part of the ref loop
                        #still containing my earlier change to test for empty intervals
                        #...
                        out.write( sliced )
    

    The only downside I can see is if the maf file contains such blocks multiple times. I'm not convinced that this "better" fix might just be reporting the same thing multiple times.

  2. James Taylor repo owner

    I would commit it, it is an improvement. I had never encountered such blocks since multiz is (was?) incapable of producing them. Although I suppose the self alignments would exacerbate this.

  3. Bob Harris

    Where we ran into these is with the EPO alignments (from ensembl). They allow segmental duplications. I'm pretty sure multiz doesn't.

    I'll commit the change in the next few days. (pester me if I don't).

    FYI, there are other issues with EPO alignments, in maf form, but they can be avoided by knowing which species were considered high coverage and which were low coverage (not necessarily the same as whether the assembly really was high or low), and discarding all the low coverage ones (the components for those are not proper maf, in at least three important aspects).

  4. Log in to comment