consider reworking gene track inclusion

Thomas Gilgenast reporter

as an update, I think that http://hgdownload.soe.ucsc.edu/goldenPath/<assembly>/database/refGene.txt.gz should work to get the UCSC RefSeq genes according to this text from the UCSC genome browser:

The raw data for these tracks can be accessed in multiple ways. It can be explored interactively using the Table Browser or Data Integrator. The tables can also be accessed programmatically through our public MySQL server or downloaded from our downloads server for local processing.

The data in the RefSeq Other and RefSeq Diffs tracks are organized in bigBed file format; more information about accessing the information in this bigBed file can be found below. The other subtracks are associated with database tables as follows:

genePred format:

RefSeq All - ncbiRefSeq

RefSeq Curated - ncbiRefSeqCurated

RefSeq Predicted - ncbiRefSeqPredicted

UCSC RefSeq - refGene

also noting that this really is the gene annotation we are typically using:

UCSC RefSeq – annotations generated from UCSC's realignment of RNAs with NM and NR accessions to the human genome. This track was previously known as the "RefSeq Genes" track.

this means that we should be able to download gene annotations on-demand using something like

>>> import urllib
>>> urllib.urlretrieve('http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz', filename='~/.lib5c/hg38_refseq_genes.gz')

where passing filename='~/.lib5c/hg38_refseq_genes.gz' doesn't literally work but we would create some kind of utility function to figure out this path, like:

>>> import os
>>> lib5c_folder = os.path.join(os.path.expanduser('~'), '.lib5c')
>>> gene_file = os.path.join(lib5c_folder, '%s_refseq_genes.gz' % assembly)

and make sure that the lib5c_folder exists via check_outdir('%s/' % lib5c_folder)

one issue is that these gene annotations may be updated, i.e., they are not fixed for a certain genome assembly such as 'mm9'

at least in the old approach, there was consistency about which version of the gene annotations were being plotted, because they were the single time-frozen version that we first bundled with lib5c

in the new approach, the latest version will be grabbed if the annotation for that assembly had not been downloaded before, and if it had been downloaded before then it won't be updated unless that file is manually deleted from the ~/.lib5c folder

somehow I doubt that these annotations can change super significantly, and the details of the gene annotation currently in use can always be checked by looking at the timestamp on the file, so I think this is okay; if UCSC starts versioning these files then we can rethink this

2018-12-19T23:39:02+00:00

Thomas Gilgenast reporter

we opened this back up after ~~#28~~ and came up with a new solution: we could use cruzdb to load genes from any assembly on-the-fly, which is actually way faster than our current loading of the entire gene table from the local disk. we would pick up an additional dependency on pymysql, but it weighs in at under 50 KB (cruzdb is 30 KB; the packages don't appear to trigger installation of any other dependencies). there's an example of this at the end of this notebook:

https://colab.research.google.com/drive/1dHljn0_4pSaqo8nPz-wNNBbKe7wmFIZX

as an added bonus, this should also allow easy toggling between refGene and knownGene tables for example

2019-04-18T03:08:57+00:00

Thomas Gilgenast reporter

we noted that client libraries (e.g., hiczoom) might depend on the bundled gene tracks existing on disk in particular locations

that wasn’t really an officially exposed API, but this just suggests that going forward there should be an officially exposed API to get gene information, which can be used by the gene track plotter but also by client libraries as desired

it could end up being placed in lib5c.contrib.cruzdb and might have a signature like (genome, track_name, grange) -> features where genome is like 'hg19', track_name is like 'refGene', grange is like {'chrom': 'chr1', 'start': 0, 'end': 1000000}, and features is a list of feature dicts whose keys match the old lib5c.parsers.genes.load_gene_table() format

it could take kwargs to keep only the longest isoform, see #49

2019-06-05T21:12:15+00:00

Thomas Gilgenast reporter

changed milestone to 0.6.0

2019-06-21T19:11:18+00:00

Comments (4)