Dear authors of bx_python,
thank you for sharing this very useful piece of software! I attempted to use the galaxy scripts from command line to index the MAF blocks of the 100 mammal alignments from UCSC (hg19_100way). I succeed with fly (dm3_15way), but for the big alignment I run into a size-restriction. The MAFs for the largest chromosome are very big (chr1.maf is ~22GB with LZO compression). Trying to build an index with maf_build_index.py fails with:
File "/home/marjens/galaxy/cluster_env_for_galaxy/bin/maf_build_index.py", line 83, in <module> if __name__ == "__main__": main() File "/home/marjens/galaxy/cluster_env_for_galaxy/bin/maf_build_index.py", line 80, in main indexes.write( out ) File "/home/marjens/galaxy/cluster_env_for_galaxy/lib/python2.7/site-packages/bx/interval_index_file.py", line 332, in write write_packed( f, ">I", base ) File "/home/marjens/galaxy/cluster_env_for_galaxy/lib/python2.7/site-packages/bx/interval_index_file.py", line 463, in write_packed f.write( pack( pattern, *vals ) ) struct.error: 'I' format requires 0 <= number <= 4294967295
I need this to create stitched FASTA sequences later, using interval_maf_to_merged_fasta.py
Again, this works for fly, but fails for the much larger 100way. The indices for smaller chromosomes can be built and are all below (but getting close to) 4GB file size. I believe that the base/offset of a bin-index inside the index file is the culprit. But replacing this in the open() and write() with a 'Q' instead of 'I' breaks the binary format. My insight into this is limited (and of course also time, as usual).
However, since such huge alignments are used in publications (and for instance work in the UCSC genome browser multiz view) I assume that this problem has already been solved one way or the other. I would be very happy if you could give me a hint as to how to solve or circumvent this problem.
Thank you very much and best regards.