HTTPS SSH

What is this repository for?

This repository contains scripts and Java code used in the maize HapMap3 project. The pipeline consists of four steps: raw genote calling, IBD-based filtering of the raw calls, LD filtering, and LDKNN imputation. Each of these steps is accomplished using custom Java code and some bash and perl helper scripts. The Java code utilizes classes from the TASSEL package developed in Buckler Lab at Cornell University. Different versions of TASSEL are used in different pipeline programs. They are all included in this distribution.

The HapMap3 project involves large amounts of data, processed over several years. Individual parts of the pipeline have been executed separately, generating intermediate output files to be used in subsequent steps. The computations have been parallelized in different ways (some over taxa, others - over genomic coordinate) and distributed over multiple machines.

The code runs on Linux only.

How do I get set up?

Download the repository

git clone https://bukowski1@bitbucket.org/bukowski1/maize_hapmap3_code.git

To do this, you will need to have git installled on your machine. The git clone command will create a directory maize_hapmap3_code in your current directory where the command was executed. This directory will contain the following subfolders:

  • IBD: helper scripts for the IBD filtering step
  • LD : helper scripts for the LD filtering and coversion of genotypes to VCF format with proper INFO field parameters
  • LDKNNi: helper scripts for the LDKNN imputation step
  • raw_genos: helper scripts for the raw genotyping step
  • java_code: contains Netbeans projects with the Java code used in the pipeline

Each of the subfolders contains a README file explaining how to execute this pipeline step.

Configuration

In each of the subfolders of maize_hapmap3_code except for java_code edit all shell and perl scripts (all *.sh and *.pl files) and change declaration of the variable BINDIR to the full path of the maize_hapmap3_code directory.

To examine or modify the java code, it would be convenient to set it up in a Java IDE, such as Netbeans or Eclipse. In the directory java_code, each of the subfolders ContMatPack, IBDfilter, LDfilter, pileup2hdf5, and tassel5 is a separate Nebeans project and should be straightforward to open in Netbeans, possibly with some minor confiburation adjustments.

Dependencies

The pipeline requires Java 8, perl, awk, and samtools to be available. The samtools command should be in the PATH.

How to run tests

The are currently no test data to run on.

Owner and admin of the repository

Robert Bukowski
Cornell University
Institute of Biotechnology
Biotechnology Resource Center
Bioinformatics Facility (formerly: CBSU)
620 Rhodes Hall, Ithaca, NY 14853
E-mail: bukowski@cornell.edu