1. James Taylor
  2. bx-python
  3. Issues
Issue #6 on hold

report percent id for blastz/lav parser

Brent Pedersen
created an issue

the percent id calculated, but it's not accessible: http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/align/lav.py#cl-327

(if it is, an example in the tests would be useful).

Comments (6)

  1. Brent Pedersen reporter

    i have a patch for this attached. however, i'd like feedback on how it's done -- just sticking an arbitrary attribute 'percent_id' on the Alignment object. is there a better way? also this file is full of tabs, so i'd like to commit the version where it's converted to spaces. comments?

  2. Brent Pedersen reporter

    hmm, i dont know how to add a file. here's the patch:

    diff --git a/lib/bx/align/lav.py b/lib/bx/align/lav.py
    --- a/lib/bx/align/lav.py
    +++ b/lib/bx/align/lav.py
    @@ -324,6 +324,8 @@ class Reader(object):
                    self.open_seqs()
                    text1 = text2 = ""
                    end1  = end2  = None
    +               p_len = 0.0
    +               tot_len = 0.0
                    for (start1,start2,length,pctId) in pieces:
                            if (end1 != None):
                                    if (start1 == end1): # insertion in sequence 2
    @@ -337,6 +339,8 @@ class Reader(object):
                            text2 += self.seq2_file.get(start2,length)
                            end1 = start1 + length
                            end2 = start2 + length
    +                       p_len += pctId * length
    +                       tot_len += length
                    # create alignment
                    start1 = pieces[0][0]
                    start2 = pieces[0][1]
    @@ -349,6 +353,9 @@ class Reader(object):
                    a.add_component(Component(self.seq1_src,start1,size1,self.seq1_strand,text=text1))
                    #if (self.seq2_strand == "-"): start2 = self.seq2_file.length - end2
                    a.add_component(Component(self.seq2_src,start2,size2,self.seq2_strand,text=text2))
    +
    +        # add the overall percent_id for the entire block
    +               a.percent_id = (p_len / 100.) / tot_len
                    return a
    
            def path_to_src_name(self,path_name):
    
  3. Bob Harris

    Howdy, Brent,

    I have several related points below.

    (1) Your patch looks reasonable, but I would change one thing. In the statement "p_len += pctId * length", you should probably do some rounding to compensate for the fact that the number in the lav file is rounded to nearest percent. E.g. if the piece has length 14 and the alignment had 9 matches, the value for pctid in the file is 64. In your sum, you are adding 896 for this, effectively 8.96. Possibly such errors will cancel over the whole sum.

    (2) You may be better off just recomputing percent_id from the alignment components' text. Blastz occasionally produces the wrong pct_id for a piece, up or down 1 from the correct rounded value. This is because it had a bug in which it missed a few base positions in the calculation. LAV doesn't contain the alignment text, but it is re-fetched in that method, so you could just scan the text to compute percent_id. And I think the proper place for such a method would be in the alignment class, so that it would not be tied solely to LAV.

    (3) What is your motivation to use LAV? If it is only because you are using BLASTZ to create new alignments, you could instead use LASTZ (at the Miller Lab website, www.bx.psu.edu/miller_lab). It is a drop-in replacement for BLASTZ but has bug corrections and adds more features. Among the new features is support for many output formats, including LAV, MAF and a general tab-delimited format. In the latter format you can choose identity as one of the columns produced. While there is currently no reader object for this format, it would be easy to write one.

    (4) For what it is worth, we have written a LAV spec years after the fact (also at the MIller lab page).

    Bob H

  4. Brent Pedersen reporter

    hi bob, as you know i have since switched to using LASTZ. so i no longer need to parse .lav files.

    i'll leave your (1) suggestion as i'm not sure the best way to correct that and i'm to lazy to dig around as you suggest in (2). thanks.

    -brent

  5. Bob Harris

    Sounds good, so I'll mark this issue as "put on hold".

    Bob H

    P.S. I didn't actually make the connection that you were the same brentp that emailed me a few says back. James can attest to the fact that my short term memory is pretty poor.

  6. Log in to comment