Using HG38 Tutorial (or at least more detailed documentation)?

Issue #21 new
Sviatoslav Kendall created an issue

I want to use OncodriveFML on data that uses HG38 as the reference but the documentation explaining how to do this is a little bit scattered and vague. It would be extremely helpful to have a tutorial walking users through the process of setting everything up to run an analysis using HG38 as the reference genome. Alternatively, simply expanding the documentation sections listed below to make them more detailed would also go a long way towards making OncodriveFML more user-friendly.

Reading the “Configuration” page of the documentation makes clear that the reference genome must be downloaded from UCSC’s website but doesn’t specify which files to downloaded or where to store them so that OncodriveFML can access them. Further down on the same page is a warning that the scores file must be compatible with the reference genome being used and to “update all related parameters” but it’s not clear how to make sure the scores file is compatible with the reference genome or what those “related parameters” would be.

On the “Behind the Scenes” page of the documentation, there are bgdata commands that return error messages when run from the command line:

$ bgdata datasets genomereference hg19
Usage: bgdata [OPTIONS] COMMAND [ARGS]...

Error: No such command "datasets".

Is this an indication that the bgdata package is not working the way it is supposed to? Is the OncodriveFML documentation providing bgdata commands that don’t work? Are these commands meant to be called inside of a Python shell - because they produce error messages there too…

Also on the “Behind the Scenes” page of the documentation, there are instructions to modify the code in the "oncodrivefml.signature" module without any clear explanation of what sorts of modifications need to be made. Looking at the code for the signature.py module, one can infer that simply modifying the configuration file to specify build = 'hg38' might be all that’s necessary as long as the hg38 reference genome has been downloaded.

On a related note: the version of bgdata that comes with OncodriveFML does not seem to work as the bgdata documentation says it should. Specifically the "search" command seems to allow searching in directories but not subdirectories; the command bgdata search datasets returns a list of apparent subdirectories while the command bgdata search datasets/genomereference returns nothing.

In case it matters: I am running Ubuntu 18.04 and Python 3.6.8. I have installed using pip3 and the versions of your software I have are: oncodrivefml, version 2.2.0, and bgdata, version 2.0.2

Comments (3)

  1. Iker Reyes

    We will try to update the documentation, because as you mention it is still a bit old.

    Regarding the usage of HG38, you do not need to download HG38 from UCSC, bgdata should take care of it if you change the reference genome in the configuration file. When we mention that the score files need to be compatible with the reference genome, we refer to the fact that not all CADD versions are compatible with all reference genomes. E.g. CADD 1.5 is meant to work with HG38 but not with HG19.

    The documentation related bgdata in oncodrivefml is referring to bgdata v1, and not the current version which is v2. We need to update it.

    The ``bgdata search`` command will still not work because we need to implement part of the logic in the server side and we have not done that yet.

  2. Log in to comment