Wiki

Clone wiki

demeter-course / ADAM User Guide

ADAM Demeter User Guide

ADAM Overview

ADAM is a genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.

Build Instructions

You need Java SDK 1.7 installed; make sure JAVA_HOME is set to that JDK and the bin directory is first on your path.

export JAVA_HOME=/usr/java/jdk1.7.0_25-cloudera
export PATH=$JAVA_HOME/bin:$PATH
git clone https://github.com/bigdatagenomics/adam.git
cd adam
export "MAVEN_OPTS=-Xmx512m -XX:MaxPermSize=128m"
mvn package
The MAVEN_OPTS is necessary because the default heap & permgen space isn't enough to build the entire project. JAR is available in adam/adam-cli/target/adam-0.7.1-SNAPSHOT.jar

You can save time if you've already run tests and just need to rebuild by adding -DskipTests=true to the Maven command line.

Demeter Spark Settings

# Spark Master environment variable must be set to connect to the master node, this defaults to local otherwise
# TODO: Set this in CM instead of on each job, currently the default is the shortname which does not work
MASTER=spark://demeter-login2.demeter.hpc.mssm.edu:7077 
# This is necessary for Spark worker jobs to have classes in the JAR available to them.
SPARK_CLASSPATH=$ADAM_JAR

# spark.default.parallelism controls the shuffle parallelism, Spark Doc suggestion is 2x cores, however we have seen disk timeout errors when increasing this significantly
SPARK_JAVA_OPTS+="-Dspark.default.parallelism=1024"

# From Spark Docs: "consolidates intermediate files created during a shuffle ... It is recomended to set this to "true" when using ext4 or xfs filesystems"
# TODO: Set this in CM instead of on each job
SPARK_JAVA_OPTS+=" -Dspark.shuffle.consolidateFiles=true" 

# Workers can run out of heap space easily, so bump up the max. I had to set the max to 8g to be able to finish a transform.
SPARK_JAVA_OPTS+=" -Xmx8g"

# Configure memory for spark jobs on workers
SPARK_MEM=36G

Monitoring/Debugging

You can monitor the job results at http://demeter-login2.demeter.hpc.mssm.edu:18080. The UI shows all the worker nodes (not jobs), then the running jobs, then the finished jobs. For the running jobs, the link in the 'Name' column will show current status, but if the job is done, click on the link in the 'ID' column. If you'd like to actually look at the log files, there are two ways. The first & recommended way is to use the proxy setup that Zach has configured. That way you can just use your browser and the links will all work.

Or, you can view them from a shell by deciphering the hostname from the worker name (the middle column), and check the following directory:

/var/run/spark/work/<app id>/<executor id>/{stderr,stdout,adam.log}

For example, this is the 3rd section of main status page for Spark, the one that shows completed jobs:

Screen Shot 2014-03-14 at 11.55.10 AM.png

Clicking on the link in the leftmost column brings us to the status page for that job:

Screen Shot 2014-03-14 at 11.55.19 AM.png

If you'd like to view the logfiles for the first worker, the AppID is app-20140314111329-0052, the hostname is demeter-csmau08-12.demeter.hpc.mssm.edu (extracted from the worker name in the middle column), and the executor ID is 83. So ssh into demeter-csmau08-12, from one of the login nodes, and cd into

/var/run/spark/work/app-20140314111329-0052/83

BAM/Read Processing

  • The BAM processing commands are run through the ADAM transform command.
  • The final arguments to the transform commands are the input and output paths.
  • If the paths are on HDFS, they must be the full URI including the namenode address, i.e.
INPUT=hdfs://demeter-nn1.demeter.hpc.mssm.edu/user/ahujaa01/ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam

OUTPUT=hdfs://demeter-nn1.demeter.hpc.mssm.edu/user/ahujaa01/HG00096.adam
  • All of the following commands, conversion, mark_duplicates and BQSR can be chained together and run in a single job.

Conversion

java -Xmx8g $SPARK_JAVA_OPTS -jar $ADAM_JAR transform \
-spark_master $MASTER \
-spark_jar $ADAM_JAR \
$INPUT \
$OUTPUT

Mark Duplicates

This is operated by adding the -mark_duplicate_reads flag.

java -Xmx8g $SPARK_JAVA_OPTS -jar $ADAM_JAR transform \
-mark_duplicate_reads \
-spark_master $MASTER \
-spark_jar $ADAM_JAR \
$INPUT \
$OUTPUT.md

BQSR

This is operated by adding the -recalibrate_base_qualities flag and specifying a dbSnp sites file. The sites file is list of chromosome and position tab-separated pairs.

DBSNP=/hpc/users/ahujaa01/release/adam/dbsnp_137.b37.excluding_sites_after_129.sites

java -Xmx8g $SPARK_JAVA_OPTS -jar $ADAM_JAR transform \
-recalibrate_base_qualities \
-dbsnp_sites $DBSNP
-spark_master $MASTER \
-spark_jar $ADAM_JAR \
$INPUT \
$OUTPUT.bqsr

VCF/Variant Processing

For now, you need to clone https://github.com/nealsid/adam and use the adam-vcf branch.

VCF to ADAM

Try this for some local data (be sure to change the output filename)

$ java -jar adam-cli/target/adam-0.6.1-SNAPSHOT.jar vcf2adam file:///hpc/users/sidhwn01/r1-1-1.top10k.vcf file://<full path name including leading />

Note the URL syntax with the "file" scheme is necessary for local files. Also, it can be helpful to create a shell alias for the java command:

$ alias adam='java -jar adam-cli/target/adam-0.6.1-SNAPSHOT.jar'

Running it on the cluster is the same as for BAM file processing above. A full commandline is:

$ adam vcf2adam -spark_master spark://demeter-login1.demeter.hpc.mssm.edu:7077 -spark_home /opt/cloudera/parcels/SPARK -spark_jar /hpc/users/sidhwn01/adam-nealsid/adam-cli/target/adam-0.6.1-SNAPSHOT.jar hdfs://demeter-nn1/user/sidhwn01/CEU.exon.2010_09.genotypes.vcf hdfs://demeter-nn1/user/sidhwn01/CEU.exon.2010_09.genotypes.vcfadam

ADAM to VCF

Ask Neal if the changes have been pushed to the main tree yet (as of 2/11/14), or he can give you a JAR file.

$ SPARK_JAR=<JAR file> SPARK_JAVA_OPTS=-Xmx16g java -Xmx16g -jar adam-0.6.1-SNAPSHOT.jar adam2vcf hdfs://de\
meter-nn1/user/sidhwn01/r1-1-1.combined.vcfadam hdfs://demeter-nn1/<some output path that ends in .VCF>

A few things to note: the output path must end in .VCF, otherwise the GATK VCF output libraries will complain. I usually use .adam.vcf to distinguish from normal VCF files. And I've found that the JAVA heap space for the workers needs to be increased via SPARK_JAVA_OPTS.

The output of the adam2vcf conversion is a sharded file inside a directory that you specified as the output on the command line. The header is duplicated among each part. This will likely be cleaned up in the near future.

Spark Shell

Playing around in the spark-shell

Run the shell with the command below. Note that it will run locally, not on the cluster. You can add the above options like -spark_master and -spark_jar if you want to run interactive computation on the cluster, but be aware that it will use up all available cores by default, even while the shell is sitting idle.

$ SPARK_CLASSPATH=<path to your ADAM jar> spark-shell

Now, add some imports to reference types & methods that are necessary:

scala> import edu.berkeley.cs.amplab.adam.rdd.variation.ADAMVariationContext._
scala> import edu.berkeley.cs.amplab.adam.rdd.AdamContext._
scala> import org.apache.spark.rdd.RDD
scala> import edu.berkeley.cs.amplab.adam.avro.ADAMGenotype

Now you can declare an RDD of some ADAM data:

scala> val variants : RDD[ADAMGenotype] = sc.adamLoad("hdfs://demeter-nn1/<some adam file>")
Some example manipulations
scala> variants.count()
scala> variants.filter(x => x.varIsFiltered != null && x.varIsFiltered == true).count()
scala> variants.map(v => (v.sampleId, 1)).reduceByKey((a, b) => a + b).collect()
scala> variants.map(v => (v.sampleId, 1)).countByKey()

Updated