This repository contains the scripts to reproduce the analyses in the Segway 2.0 appnote.
Please ensure you have the latest version of Segway installed.
General pipeline for GMM
Run Segway > run parse_bedgraph_missing_data.sh on datatrack and Segway annotation > run main_extract_*.py scripts to extract datapoints/model information > calculate KS test statistic & generate plots using generate_*.sh scripts & templates
the train/identify scripts here are the commands used to run Segway.
used to fill in missing data in the bedGraph tracks (both data and segmentation) with values of -1, to be discarded later.
python scripts used to extract datapoints under each label from the bedGraph tracks, and corresponding Gaussians/weights from the Segway/GMTK params files
libraries used by the extraction python scripts
bash scripts used to generate the commands to produce various plots/statistics from template scripts (since there are 20 component-label cases total).
template script to produce the KS test statistic for a given component number and label
template script to produce the combined theoretical/empirical histogram for a given component number and label
script to produce the combined QQ plot for the best labels in each case (label 9 for 3-component, label 2 for 1-component for the analysis done in the appnote).
the train scripts here are the commands used to run Segway.
Script used to generate the random include regions for the fixed case (requires bedtools)
Plots the validation likelihood vs round (requires the
validation.sum.tab of each type)
contains the scripts used to run the training model for each program (and their qsub command)
contains the scripts and templates used to generate training scripts for Segway for k=1-5 components and 10 random starts per component
contains the scripts and templates used to generate identify scripts for Segway corresponding to the training runs in /segway-train-scripts/
contains the scripts necessary to reproduce the TSS prediction analysis in the appnote (ie GENCODE GTF filtering, generating the truth file from ENCODE data, and automatically extracting top precision/corresponding recall metrics for each of 10 random starts for all 5 components).