breaks if index exceeds 4GB filesize

Issue #56 new
Marvin Jens
created an issue

Dear authors of bx_python,

thank you for sharing this very useful piece of software! I attempted to use the galaxy scripts from command line to index the MAF blocks of the 100 mammal alignments from UCSC (hg19_100way). I succeed with fly (dm3_15way), but for the big alignment I run into a size-restriction. The MAFs for the largest chromosome are very big (chr1.maf is ~22GB with LZO compression). Trying to build an index with fails with:

line 83, in <module>
    if __name__ == "__main__": main()
line 80, in main
    indexes.write( out )
line 332, in write
    write_packed( f, ">I", base )
line 463, in write_packed
    f.write( pack( pattern, *vals ) )
struct.error: 'I' format requires 0 <= number <= 4294967295

I need this to create stitched FASTA sequences later, using

Again, this works for fly, but fails for the much larger 100way. The indices for smaller chromosomes can be built and are all below (but getting close to) 4GB file size. I believe that the base/offset of a bin-index inside the index file is the culprit. But replacing this in the open() and write() with a 'Q' instead of 'I' breaks the binary format. My insight into this is limited (and of course also time, as usual).

However, since such huge alignments are used in publications (and for instance work in the UCSC genome browser multiz view) I assume that this problem has already been solved one way or the other. I would be very happy if you could give me a hint as to how to solve or circumvent this problem.

Thank you very much and best regards.

Comments (8)

  1. Marvin Jens reporter

    Any thoughts on this? How is this done on the genome browser/Galaxy? Did you just split the larger chromosomes into smaller files? I would really appreciate some advice. Thanks!

  2. Bob Harris

    This is a shortcoming in the design of the index file format. The format/layout is shown at the front of Line 42 shows that only 4 bytes are used to record a position within the file -- this is the failure you've encountered, and lines 58 and 60 show fields that could have the same problem.

    The only fix I can think of involves creating version 2 of the file format (implemented with backward compatible support for versions 0 and 1).

    I think all files written with version 2 in the header are going to be rejected by the current version of the package. The implication of this is that, if new files are by default written as version 2, anyone who doesn't update the package will be out of luck. It would be nice to avoid this is possible, and only write a version 2 file when a version 1 file isn't sufficient.

    One solution, probably the easiest to implement, would be for version 2 to simply write all offsets as 8 bytes. Since (I think) the entire index list exists in memory before the file is written (verify that), the package ought to be able to decide version 1 vs version 2 just before it writes the file.

    I'd like to hear what the other package authors think. I really haven't been involved with this package in many many years. It might be beneficial to seek input from a broader community, in case there's other functionality that could/should be incorporated at the same time. Or ... that might be risking functional creep.

  3. Marvin Jens reporter

    Dear Bob, thank you for clarifying this. Yes, all the Intervals are kept in memory and are only written to disk in the end. I tried to replace the base/offset field 4 byte unsigned int (pack code 'I') with 8 byte ints (code 'Q') but this broke the format (I assume bytes_required() needs fixing as well).

    As a quick workaround, would you recommend splitting the MAF? How can I actually build an index-set that allows queries over multiple MAF files? In the end I need to use and would like to avoid splitting the BED file into chromosomes (and then potentially "sub"-chromosomes for the large ones).

  4. Bob Harris

    I don't have any recommendation for splitting the MAF.

    It would take me some time for me to figure this out. Understand that it's been nearly a decade since I've looked at this package. I contributed some code to it (axt and lav support, maybe 2bit support, not sure). While my name is in the header for interval_index_file, my involvement with it was mainly in writing a C reader or writer for that file format. The description of the file format looks like something I added, and the number-of-bytes field may have been something I added to the format (not sure). The coding style doesn't look like me, but after a decade it's hard to say.

    I notice there are several forks of this repository. Maybe this issue has been solved in one of those?

    If I find some time (no guarantee), I will try to see if I can hack up a version that uses 8 byte file offsets.

    Bob H

  5. Marvin Jens reporter

    Dear Bob, I know the feeling of looking at ones own, "dated" code (albeit I didn't look back an entire decade, yet), so I understand :) This said, the code looked clean enough for me to understand the underlying issue within a few minutes. So, no complaints there! Fixing it, of course, is a different matter... For the time being, would you know by chance where I could contact the people who made the 100way alignment available in the UCSC genome browser? I'm thinking they must have solved the issue because the multiz view works, right? Would the mailing list be a good place? Thanks again!

  6. Bob Harris

    I don't think the UCSC team uses this package, but it wouldn't hurt to ask what their solution is. The best place to ask would be the browser mailing list, info at . I know they've been pretty responsive to user questions in the past, I assume that will still be true.

    You might also ask on the galaxy list, since galaxy does use this package. Even if they haven't encountered this problem, they will have motivation to solve it if it's something their users will be running into in the near future.

    Are you indexing your maf by every species or just by one? Admittedly, I haven't thought this all the way through, but instead of the idea of splitting the my into subchromosomal pieces, could you just build a separate index file for each species?

  7. Marvin Jens reporter

    Thank you very much for pointing out the indices for different species. I now realize that the default is to build it for all 100 and that may well be the problem! In the end, I just need the human reference. Will try again now. If this reduces the size-requirement of the index file in proportion, the 4GB limit could become a non-issue again. Really helpful comment! If this doesn't help I will ask on Galaxy and UCSC in parallel. I did not know the UCSC team had a different toolchain. Thank you for pointing this out, as well.

  8. Log in to comment