consider reworking gene track inclusion
currently, we bundle the refeseq gene tracks for a selection of genome builds (mm9, mm10, hg18, hg19, hg38) into the distribution
the pro is that plotting gene tracks for these assemblies "just works" immediately
the con is that these gene track files take up the majority of our wheel/sdist size (~20MB)
one possible alternative is to
- create a
~/.lib5c
directory on install - when plotting gene tracks, check that folder to see if the gene track is already there
- if not, download it from UCSC and put it in that folder
Comments (4)
-
reporter -
reporter we opened this back up after
#28and came up with a new solution: we could use cruzdb to load genes from any assembly on-the-fly, which is actually way faster than our current loading of the entire gene table from the local disk. we would pick up an additional dependency on pymysql, but it weighs in at under 50 KB (cruzdb is 30 KB; the packages don't appear to trigger installation of any other dependencies). there's an example of this at the end of this notebook:https://colab.research.google.com/drive/1dHljn0_4pSaqo8nPz-wNNBbKe7wmFIZX
as an added bonus, this should also allow easy toggling between refGene and knownGene tables for example
-
reporter we noted that client libraries (e.g., hiczoom) might depend on the bundled gene tracks existing on disk in particular locations
that wasn’t really an officially exposed API, but this just suggests that going forward there should be an officially exposed API to get gene information, which can be used by the gene track plotter but also by client libraries as desired
it could end up being placed in
lib5c.contrib.cruzdb
and might have a signature like(genome, track_name, grange) -> features
wheregenome
is like'hg19'
,track_name
is like'refGene'
,grange
is like{'chrom': 'chr1', 'start': 0, 'end': 1000000}
, andfeatures
is a list of feature dicts whose keys match the oldlib5c.parsers.genes.load_gene_table()
formatit could take kwargs to keep only the longest isoform, see #49
-
reporter - changed milestone to 0.6.0
- Log in to comment
as an update, I think that
http://hgdownload.soe.ucsc.edu/goldenPath/<assembly>/database/refGene.txt.gz
should work to get the UCSC RefSeq genes according to this text from the UCSC genome browser:also noting that this really is the gene annotation we are typically using:
this means that we should be able to download gene annotations on-demand using something like
where passing
filename='~/.lib5c/hg38_refseq_genes.gz'
doesn't literally work but we would create some kind of utility function to figure out this path, like:and make sure that the
lib5c_folder
exists viacheck_outdir('%s/' % lib5c_folder)
one issue is that these gene annotations may be updated, i.e., they are not fixed for a certain genome assembly such as
'mm9'
at least in the old approach, there was consistency about which version of the gene annotations were being plotted, because they were the single time-frozen version that we first bundled with lib5c
in the new approach, the latest version will be grabbed if the annotation for that assembly had not been downloaded before, and if it had been downloaded before then it won't be updated unless that file is manually deleted from the
~/.lib5c
foldersomehow I doubt that these annotations can change super significantly, and the details of the gene annotation currently in use can always be checked by looking at the timestamp on the file, so I think this is okay; if UCSC starts versioning these files then we can rethink this