Wiki
Clone wikidemeter-course / ADAM User Guide
ADAM Demeter User Guide
ADAM Overview
ADAM is a genomics processing engine and specialized file format built using Apache Avro, Apache Spark and Parquet. Apache 2 licensed.
Build Instructions
You need Java SDK 1.7 installed; make sure JAVA_HOME is set to that JDK and the bin directory is first on your path.
export JAVA_HOME=/usr/java/jdk1.7.0_25-cloudera export PATH=$JAVA_HOME/bin:$PATH git clone https://github.com/bigdatagenomics/adam.git cd adam export "MAVEN_OPTS=-Xmx512m -XX:MaxPermSize=128m" mvn package
You can save time if you've already run tests and just need to rebuild by adding -DskipTests=true to the Maven command line.
Demeter Spark Settings
# Spark Master environment variable must be set to connect to the master node, this defaults to local otherwise # TODO: Set this in CM instead of on each job, currently the default is the shortname which does not work MASTER=spark://demeter-login2.demeter.hpc.mssm.edu:7077 # This is necessary for Spark worker jobs to have classes in the JAR available to them. SPARK_CLASSPATH=$ADAM_JAR # spark.default.parallelism controls the shuffle parallelism, Spark Doc suggestion is 2x cores, however we have seen disk timeout errors when increasing this significantly SPARK_JAVA_OPTS+="-Dspark.default.parallelism=1024" # From Spark Docs: "consolidates intermediate files created during a shuffle ... It is recomended to set this to "true" when using ext4 or xfs filesystems" # TODO: Set this in CM instead of on each job SPARK_JAVA_OPTS+=" -Dspark.shuffle.consolidateFiles=true" # Workers can run out of heap space easily, so bump up the max. I had to set the max to 8g to be able to finish a transform. SPARK_JAVA_OPTS+=" -Xmx8g" # Configure memory for spark jobs on workers SPARK_MEM=36G
Monitoring/Debugging
You can monitor the job results at http://demeter-login2.demeter.hpc.mssm.edu:18080. The UI shows all the worker nodes (not jobs), then the running jobs, then the finished jobs. For the running jobs, the link in the 'Name' column will show current status, but if the job is done, click on the link in the 'ID' column. If you'd like to actually look at the log files, there are two ways. The first & recommended way is to use the proxy setup that Zach has configured. That way you can just use your browser and the links will all work.
Or, you can view them from a shell by deciphering the hostname from the worker name (the middle column), and check the following directory:
/var/run/spark/work/<app id>/<executor id>/{stderr,stdout,adam.log}
For example, this is the 3rd section of main status page for Spark, the one that shows completed jobs:
Clicking on the link in the leftmost column brings us to the status page for that job:
If you'd like to view the logfiles for the first worker, the AppID is app-20140314111329-0052
, the hostname is demeter-csmau08-12.demeter.hpc.mssm.edu
(extracted from the worker name in the middle column), and the executor ID is 83. So ssh into demeter-csmau08-12
, from one of the login nodes, and cd into
/var/run/spark/work/app-20140314111329-0052/83
BAM/Read Processing
- The BAM processing commands are run through the ADAM
transform
command. - The final arguments to the
transform
commands are the input and output paths. - If the paths are on HDFS, they must be the full URI including the namenode address, i.e.
INPUT=hdfs://demeter-nn1.demeter.hpc.mssm.edu/user/ahujaa01/ftp-trace.ncbi.nih.gov/1000genomes/ftp/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam OUTPUT=hdfs://demeter-nn1.demeter.hpc.mssm.edu/user/ahujaa01/HG00096.adam
- All of the following commands, conversion, mark_duplicates and BQSR can be chained together and run in a single job.
Conversion
java -Xmx8g $SPARK_JAVA_OPTS -jar $ADAM_JAR transform \ -spark_master $MASTER \ -spark_jar $ADAM_JAR \ $INPUT \ $OUTPUT
Mark Duplicates
This is operated by adding the -mark_duplicate_reads
flag.
java -Xmx8g $SPARK_JAVA_OPTS -jar $ADAM_JAR transform \ -mark_duplicate_reads \ -spark_master $MASTER \ -spark_jar $ADAM_JAR \ $INPUT \ $OUTPUT.md
BQSR
This is operated by adding the -recalibrate_base_qualities
flag and specifying a dbSnp sites file. The sites file is list of chromosome and position tab-separated pairs.
DBSNP=/hpc/users/ahujaa01/release/adam/dbsnp_137.b37.excluding_sites_after_129.sites java -Xmx8g $SPARK_JAVA_OPTS -jar $ADAM_JAR transform \ -recalibrate_base_qualities \ -dbsnp_sites $DBSNP -spark_master $MASTER \ -spark_jar $ADAM_JAR \ $INPUT \ $OUTPUT.bqsr
VCF/Variant Processing
For now, you need to clone https://github.com/nealsid/adam and use the adam-vcf branch.
VCF to ADAM
Try this for some local data (be sure to change the output filename)
$ java -jar adam-cli/target/adam-0.6.1-SNAPSHOT.jar vcf2adam file:///hpc/users/sidhwn01/r1-1-1.top10k.vcf file://<full path name including leading />
Note the URL syntax with the "file" scheme is necessary for local files. Also, it can be helpful to create a shell alias for the java command:
$ alias adam='java -jar adam-cli/target/adam-0.6.1-SNAPSHOT.jar'
Running it on the cluster is the same as for BAM file processing above. A full commandline is:
$ adam vcf2adam -spark_master spark://demeter-login1.demeter.hpc.mssm.edu:7077 -spark_home /opt/cloudera/parcels/SPARK -spark_jar /hpc/users/sidhwn01/adam-nealsid/adam-cli/target/adam-0.6.1-SNAPSHOT.jar hdfs://demeter-nn1/user/sidhwn01/CEU.exon.2010_09.genotypes.vcf hdfs://demeter-nn1/user/sidhwn01/CEU.exon.2010_09.genotypes.vcfadam
ADAM to VCF
Ask Neal if the changes have been pushed to the main tree yet (as of 2/11/14), or he can give you a JAR file.
$ SPARK_JAR=<JAR file> SPARK_JAVA_OPTS=-Xmx16g java -Xmx16g -jar adam-0.6.1-SNAPSHOT.jar adam2vcf hdfs://de\ meter-nn1/user/sidhwn01/r1-1-1.combined.vcfadam hdfs://demeter-nn1/<some output path that ends in .VCF>
A few things to note: the output path must end in .VCF
, otherwise the GATK VCF output libraries will complain. I usually use .adam.vcf
to distinguish from normal VCF
files. And I've found that the JAVA heap space for the workers needs to be increased via SPARK_JAVA_OPTS
.
The output of the adam2vcf
conversion is a sharded file inside a directory that you specified as the output on the command line. The header is duplicated among each part. This will likely be cleaned up in the near future.
Spark Shell
Playing around in the spark-shell
Run the shell with the command below. Note that it will run locally, not on the cluster. You can add the above options like -spark_master and -spark_jar if you want to run interactive computation on the cluster, but be aware that it will use up all available cores by default, even while the shell is sitting idle.
$ SPARK_CLASSPATH=<path to your ADAM jar> spark-shell
Now, add some imports to reference types & methods that are necessary:
scala> import edu.berkeley.cs.amplab.adam.rdd.variation.ADAMVariationContext._ scala> import edu.berkeley.cs.amplab.adam.rdd.AdamContext._ scala> import org.apache.spark.rdd.RDD scala> import edu.berkeley.cs.amplab.adam.avro.ADAMGenotype
Now you can declare an RDD of some ADAM data:
scala> val variants : RDD[ADAMGenotype] = sc.adamLoad("hdfs://demeter-nn1/<some adam file>")
scala> variants.count() scala> variants.filter(x => x.varIsFiltered != null && x.varIsFiltered == true).count() scala> variants.map(v => (v.sampleId, 1)).reduceByKey((a, b) => a + b).collect() scala> variants.map(v => (v.sampleId, 1)).countByKey()
Updated