consider reworking gene track inclusion

Issue #46 new
Thomas Gilgenast created an issue

currently, we bundle the refeseq gene tracks for a selection of genome builds (mm9, mm10, hg18, hg19, hg38) into the distribution

the pro is that plotting gene tracks for these assemblies "just works" immediately

the con is that these gene track files take up the majority of our wheel/sdist size (~20MB)

one possible alternative is to

  1. create a ~/.lib5c directory on install
  2. when plotting gene tracks, check that folder to see if the gene track is already there
  3. if not, download it from UCSC and put it in that folder

Comments (4)

  1. Thomas Gilgenast reporter

    as an update, I think that http://hgdownload.soe.ucsc.edu/goldenPath/<assembly>/database/refGene.txt.gz should work to get the UCSC RefSeq genes according to this text from the UCSC genome browser:

    The raw data for these tracks can be accessed in multiple ways. It can be explored interactively using the Table Browser or Data Integrator. The tables can also be accessed programmatically through our public MySQL server or downloaded from our downloads server for local processing.

    The data in the RefSeq Other and RefSeq Diffs tracks are organized in bigBed file format; more information about accessing the information in this bigBed file can be found below. The other subtracks are associated with database tables as follows:

    genePred format:

    • RefSeq All - ncbiRefSeq
    • RefSeq Curated - ncbiRefSeqCurated
    • RefSeq Predicted - ncbiRefSeqPredicted
    • UCSC RefSeq - refGene

    also noting that this really is the gene annotation we are typically using:

    UCSC RefSeq – annotations generated from UCSC's realignment of RNAs with NM and NR accessions to the human genome. This track was previously known as the "RefSeq Genes" track.

    this means that we should be able to download gene annotations on-demand using something like

    >>> import urllib
    >>> urllib.urlretrieve('http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz', filename='~/.lib5c/hg38_refseq_genes.gz')
    

    where passing filename='~/.lib5c/hg38_refseq_genes.gz' doesn't literally work but we would create some kind of utility function to figure out this path, like:

    >>> import os
    >>> lib5c_folder = os.path.join(os.path.expanduser('~'), '.lib5c')
    >>> gene_file = os.path.join(lib5c_folder, '%s_refseq_genes.gz' % assembly)
    

    and make sure that the lib5c_folder exists via check_outdir('%s/' % lib5c_folder)

    one issue is that these gene annotations may be updated, i.e., they are not fixed for a certain genome assembly such as 'mm9'

    at least in the old approach, there was consistency about which version of the gene annotations were being plotted, because they were the single time-frozen version that we first bundled with lib5c

    in the new approach, the latest version will be grabbed if the annotation for that assembly had not been downloaded before, and if it had been downloaded before then it won't be updated unless that file is manually deleted from the ~/.lib5c folder

    somehow I doubt that these annotations can change super significantly, and the details of the gene annotation currently in use can always be checked by looking at the timestamp on the file, so I think this is okay; if UCSC starts versioning these files then we can rethink this

  2. Thomas Gilgenast reporter

    we opened this back up after #28 and came up with a new solution: we could use cruzdb to load genes from any assembly on-the-fly, which is actually way faster than our current loading of the entire gene table from the local disk. we would pick up an additional dependency on pymysql, but it weighs in at under 50 KB (cruzdb is 30 KB; the packages don't appear to trigger installation of any other dependencies). there's an example of this at the end of this notebook:

    https://colab.research.google.com/drive/1dHljn0_4pSaqo8nPz-wNNBbKe7wmFIZX

    as an added bonus, this should also allow easy toggling between refGene and knownGene tables for example

  3. Thomas Gilgenast reporter

    we noted that client libraries (e.g., hiczoom) might depend on the bundled gene tracks existing on disk in particular locations

    that wasn’t really an officially exposed API, but this just suggests that going forward there should be an officially exposed API to get gene information, which can be used by the gene track plotter but also by client libraries as desired

    it could end up being placed in lib5c.contrib.cruzdb and might have a signature like (genome, track_name, grange) -> features where genome is like 'hg19', track_name is like 'refGene', grange is like {'chrom': 'chr1', 'start': 0, 'end': 1000000}, and features is a list of feature dicts whose keys match the old lib5c.parsers.genes.load_gene_table() format

    it could take kwargs to keep only the longest isoform, see #49

  4. Log in to comment