HTTPS SSH

This repository contains the scripts to reproduce the analyses in the Segway 2.0 appnote.

Important

Please ensure you have the latest version of Segway installed.

GMM


General pipeline for GMM

Run Segway > run parse_bedgraph_missing_data.sh on datatrack and Segway annotation > run main_extract_*.py scripts to extract datapoints/model information > calculate KS test statistic & generate plots using generate_*.sh scripts & templates

/segway_scripts/

the train/identify scripts here are the commands used to run Segway.

parse_bedgraph_missing_data.sh

used to fill in missing data in the bedGraph tracks (both data and segmentation) with values of -1, to be discarded later.

main_extract_*.py

python scripts used to extract datapoints under each label from the bedGraph tracks, and corresponding Gaussians/weights from the Segway/GMTK params files

extract_*.py

libraries used by the extraction python scripts

generate_*.sh

bash scripts used to generate the commands to produce various plots/statistics from template scripts (since there are 20 component-label cases total).

ks_test.R

template script to produce the KS test statistic for a given component number and label

plot_hist_cdf.R

template script to produce the combined theoretical/empirical histogram for a given component number and label

doublecomponentplot.R

script to produce the combined QQ plot for the best labels in each case (label 9 for 3-component, label 2 for 1-component for the analysis done in the appnote).

minibatch-fixed


/segway_scripts/

the train scripts here are the commands used to run Segway.

generate_fixed_include_regions.sh

Script used to generate the random include regions for the fixed case (requires bedtools)

plot_validation_likelihood.py

Plots the validation likelihood vs round (requires the validation.sum.tab of each type)

benchmarking


contains the scripts used to run the training model for each program (and their qsub command)

TSS_prediction


/bin/segway-train-scripts/

contains the scripts and templates used to generate training scripts for Segway for k=1-5 components and 10 random starts per component

/bin/segway-identify-scripts/

contains the scripts and templates used to generate identify scripts for Segway corresponding to the training runs in /segway-train-scripts/

/TSS_prediction_analysis/

contains the scripts necessary to reproduce the TSS prediction analysis in the appnote (ie GENCODE GTF filtering, generating the truth file from ENCODE data, and automatically extracting top precision/corresponding recall metrics for each of 10 random starts for all 5 components).