# genotype_space /

Filename Size Date modified Message
bin
data
docs
jobs
logs
params
results
src
test
2.4 KB
877 B
34.3 KB
111 B
8.6 KB
19.8 KB

# Human Genome Variation and the Concept of Genotype Networks

## Description

This repository contains code, datasets and results from the analysis presented in:

Dall'Olio GM, Bertranpetit J, Wagner A, Laayouni H.
Human Genome Variation and the Concept of Genotype Networks.
Available at http://arxiv.org/abs/1309.0657


It also includes python and R code that can be used to represent a set of genotype networks and do analysis on their properties.

## Project outline, TO-DO list and bugs

Project Proposal and TO-DO lists are implemented in separate trello boards.

Bugs are usually posted to the TO-DO trello board, in one of the TO-DO columns.

## Parameters and Running

The whole pipeline is encoded in the form of a Rakefile, included in the main directory. All the tasks are described in this file; to see a list of all possible tasks, type:

$: rake -T rake boxplots_SNPannotations # Plot boxplots of values distribution, distinguish coding/noncoding SNPs rake check_hub # check the UCSC hub rake convert_to_binary # Convert filtered vcf files to binary strings rake count_features # Count number of CODING-NONCODING-LossOfFunction SNPs per window rake filter_vcf # Filter VCF files, removing indels, unphased, and applying maf filter rake generate_hub # Generate UCSC hub rake get_gene_coordinates # Get the gene coordinates from UCSC server rake get_genotypes # Get the genotypes from 1000genomes rake get_pvalues # Get p-values from simulations rake help # show Help rake launch_huge_sims # Launch Huge Sample Size simulations and convert them to binary format, on a Grid Engine Cluster rake launch_sims # Launch cosi simulations and convert them to binary format, on a Grid Engine Cluster rake main # Main task. rake merge_annotations # Merge Annotations and Results rake merge_network_properties # Merge network properties and gene coordinates in a single file rake network_properties_report # Convert all binary files to binary graphs, save them to graphml, and generate a report rake plot_multiple_samplesize # Plots of network properties distribution for different sample sizes rake plot_multiple_windows # Plots of network properties distribution for different definitions of window size rake random_sims # Launch random simulations rake random_simulations_background # Get background from random simulations rake random_simulations_background_bysamplesize # Get background from random simulations by sample size rake report_distance_definition # Report on how many components are formed for every possible value of distance_definition rake report_multiple_distances # Launch generate_report script for many different values of distance rake report_multiple_samplesize # Report of network properties by varying sample size rake simulations # Run Simulations rake simulations_background # Get background from all simulations rake simulations_to_binary # Convert all simulations to binary format rake simulations_to_binary_cluster # Convert all simulations to binary format, on the cluster rake stats # generate activity and punchcard plots for this repository rake test # Execute doctests and unittests rake test_pvalue # Test the algorithm to calculate p-values rake touchy # Utility task to touch all files and avoid executing the pipeline for small changes # Note: The actual output may change in future versions  To run the whole pipeline, you will need to define first a main parameter file and a list of genes file (described below). Then, execute everything by invoking the main rule in the Rakefile: $: rake main


### Main Parameters File

The main parameters file contains all the options and definitions needed for the analysis.

It is defined as an YAML file. Example of parameters file:

pathway:
pathway_name: 'all_pathways'

data:
populations:
- 'EUR'
- 'ASN'
- 'AMR'
- 'AFR'

genotype_filtering:
maf: 0.01
minGQ: 0.00
minDP: 1


By default, when you invoke the main Rakefile rule, it executes the file params/default.yaml. To use a different file, change the value of the variable params_file in the Rakefile.

### List of genes file

You also need to provide a file specifying which genes you want to analyse.

The path to this file can be defined in the params_file described above, or inferred automatically from the value of the pathway_name variable.

The format of this file is simple: just three columns, one for the gene names (they must match the refGene table in UCSC), and two for pathway classification.

Example of genes list file:

#gene   pathway            subpathway
ALG3    N-glycosylation    precursor_biosynthesis
ALG10B  N-glycosylation    precursor_biosynthesis
DPAGT1  N-glycosylation    precursor_biosynthesis
ALG14   N-glycosylation    precursor_biosynthesis
ALG6    N-glycosylation    precursor_biosynthesis


Note that all lines beginning with a '#' are ignored.

### Getting Gene coordinates

The next step is to get the coordinates of the genes in the previous file. You can do it quickly by invoking the get_gene_coordinates rule in the Rakeifile:

\$: rake get_gene_coordinates


The coordinates are taken from the hg19 "knownGene" table in UCSC. Please note that multiple versions of this table may exists for a single hg release, so take note of the date when you downloaded the coordinates.

For a real analysis, it is also recommended to use the https://bitbucket.org/dalloliogm/ucsc-fetch/overview script to get the coordinates of all the regions, and check that they are correct

Note: A rakefile rule to automatize this task will be implemented soon.

### Executing all the rest

As explained above, once you have defined a params_file and a genes_file, the whole pipeline can be executed by typing rake on the command line.

## Availability

This repository is available at http://bitbucket.org/dalloliogm/genotype_space/overview

You may also want to check the repository https://bitbucket.org/dalloliogm/vcf2networks for a tool to easily calculate genotype networks from VCF files.

### Dependencies

This is an imcomplete list of what you need to have installed in order for this pipeline to work:

Ruby:

• Rake
• yaml parser for rake (should be included in the default distro)

Python:

R:

• R 2.15.1 or higher
• distr library (to calculate p-values from simulations)
• ggplot2 library > 0.9.2.1 (to generate plots)
• data.table
• plyr
• reshape
• yaml