Wiki

Clone wiki

predictd / Home

Welcome

Welcome to the PREDICTD wiki! PREDICTD is a tensor decomposition model for imputing epigenomics maps. This wiki provides more detailed documentation about the PREDICTD code, as well as tutorials for setting up a cluster on Amazon Web Services, running PREDICTD to impute missing data in the Roadmap Epigenomics Consolidated data, and running PREDICTD to impute data for a novel cell type.

Publication: Durham T, Libbrecht M, Howbert J, Bilmes J, Noble W. PREDICTD PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition. Nature Communications (in press). 2018.

To learn more about the resource requirements of PREDICTD, see this page.

Tutorials:

Documentation:

The PREDICTD code base is a collection of Python scripts that are mostly run sequentially to assemble a data set, train one or more models, and generate the imputed output. The individual scripts are described in the links below, along with their usage and command line arguments.

Scripts

Libraries

Dependencies:

PREDICTD relies on a few other software products that come pre-installed on the AWS AMI, but that users will have to install themselves in other contexts. Below is a list of all of the dependencies for PREDICTD:

Recommendations

There are other tips and pieces of software that are not required, but that can nevertheless make your experience using AWS and Spark easier and more useful. This section has some assorted suggestions:

  • It is really helpful to have a client program to make it easier to browse S3 storage from a local machine, to manage smaller uploads/downloads, and to handle certain configuration tasks. A particularly nice client for Windows is called CloudBerry Explorer. It is well-designed, and has a free version that allows access to all of the most essential functionality with a more fully-featured version available for purchase.

  • Computing in a distributed framework like Spark can make debugging difficult. To get a better idea of what code Spark is executing and what resources are being used on the cluster, make sure to look at the Spark console and the Ganglia cluster monitoring report. Both are available from specific ports on your cluster node (for Ganglia to be available after cluster initiation with spark-ec2 you must have given spark-ec2 the --ganglia option on the command line).

    • Spark UI: http://public.DNS.of.your.instance:8080
    • Ganglia UI: http;//public.DNS.of.your.instance:5080/ganglia

Updated