Human Genome Variation and the Concept of Genotype Networks
This repository contains code, datasets and results from the analysis presented in:
Dall'Olio GM, Bertranpetit J, Wagner A, Laayouni H. Human Genome Variation and the Concept of Genotype Networks. Available at http://arxiv.org/abs/1309.0657
It also includes python and R code that can be used to represent a set of genotype networks and do analysis on their properties.
Project outline, TO-DO list and bugs
Project Proposal and TO-DO lists are implemented in separate trello boards.
- Project Proposal board: https://trello.com/b/HxRUQmaM
- TO-DO list and tasks: https://trello.com/b/jdA7Ub7h
Bugs are usually posted to the TO-DO trello board, in one of the TO-DO columns.
Parameters and Running
The whole pipeline is encoded in the form of a Rakefile, included in the main directory. All the tasks are described in this file; to see a list of all possible tasks, type:
$: rake -T rake boxplots_SNPannotations # Plot boxplots of values distribution, distinguish coding/noncoding SNPs rake check_hub # check the UCSC hub rake convert_to_binary # Convert filtered vcf files to binary strings rake count_features # Count number of CODING-NONCODING-LossOfFunction SNPs per window rake filter_vcf # Filter VCF files, removing indels, unphased, and applying maf filter rake generate_hub # Generate UCSC hub rake get_gene_coordinates # Get the gene coordinates from UCSC server rake get_genotypes # Get the genotypes from 1000genomes rake get_pvalues # Get p-values from simulations rake help # show Help rake launch_huge_sims # Launch Huge Sample Size simulations and convert them to binary format, on a Grid Engine Cluster rake launch_sims # Launch cosi simulations and convert them to binary format, on a Grid Engine Cluster rake main # Main task. rake merge_annotations # Merge Annotations and Results rake merge_network_properties # Merge network properties and gene coordinates in a single file rake network_properties_report # Convert all binary files to binary graphs, save them to graphml, and generate a report rake plot_multiple_samplesize # Plots of network properties distribution for different sample sizes rake plot_multiple_windows # Plots of network properties distribution for different definitions of window size rake random_sims # Launch random simulations rake random_simulations_background # Get background from random simulations rake random_simulations_background_bysamplesize # Get background from random simulations by sample size rake report_distance_definition # Report on how many components are formed for every possible value of distance_definition rake report_multiple_distances # Launch generate_report script for many different values of distance rake report_multiple_samplesize # Report of network properties by varying sample size rake simulations # Run Simulations rake simulations_background # Get background from all simulations rake simulations_to_binary # Convert all simulations to binary format rake simulations_to_binary_cluster # Convert all simulations to binary format, on the cluster rake stats # generate activity and punchcard plots for this repository rake test # Execute doctests and unittests rake test_pvalue # Test the algorithm to calculate p-values rake touchy # Utility task to touch all files and avoid executing the pipeline for small changes # Note: The actual output may change in future versions
To run the whole pipeline, you will need to define first a main parameter file and a list of genes file (described below). Then, execute everything by invoking the main rule in the Rakefile:
$: rake main
Main Parameters File
The main parameters file contains all the options and definitions needed for the analysis.
It is defined as an YAML file. Example of parameters file:
pathway: pathway_name: 'all_pathways' data: populations: - 'EUR' - 'ASN' - 'AMR' - 'AFR' genotype_filtering: maf: 0.01 minGQ: 0.00 minDP: 1
By default, when you invoke the main Rakefile rule, it executes the file params/default.yaml. To use a different file, change the value of the variable params_file in the Rakefile.
List of genes file
You also need to provide a file specifying which genes you want to analyse.
The path to this file can be defined in the params_file described above, or inferred automatically from the value of the pathway_name variable.
The format of this file is simple: just three columns, one for the gene names (they must match the refGene table in UCSC), and two for pathway classification.
Example of genes list file:
#gene pathway subpathway ALG3 N-glycosylation precursor_biosynthesis ALG10B N-glycosylation precursor_biosynthesis DPAGT1 N-glycosylation precursor_biosynthesis ALG14 N-glycosylation precursor_biosynthesis ALG6 N-glycosylation precursor_biosynthesis
Note that all lines beginning with a '#' are ignored.
Getting Gene coordinates
The next step is to get the coordinates of the genes in the previous file. You can do it quickly by invoking the get_gene_coordinates rule in the Rakeifile:
$: rake get_gene_coordinates
The coordinates are taken from the hg19 "knownGene" table in UCSC. Please note that multiple versions of this table may exists for a single hg release, so take note of the date when you downloaded the coordinates.
For a real analysis, it is also recommended to use the https://bitbucket.org/dalloliogm/ucsc-fetch/overview script to get the coordinates of all the regions, and check that they are correct
Note: A rakefile rule to automatize this task will be implemented soon.
Executing all the rest
As explained above, once you have defined a params_file and a genes_file, the whole pipeline can be executed by typing rake on the command line.
This repository is available at http://bitbucket.org/dalloliogm/genotype_space/overview
You may also want to check the repository https://bitbucket.org/dalloliogm/vcf2networks for a tool to easily calculate genotype networks from VCF files.
This is an imcomplete list of what you need to have installed in order for this pipeline to work:
- yaml parser for rake (should be included in the default distro)
- python 2.6 or higher
- igraph library, version 0.6 ( http://igraph.sf.net ). NOTE: older 0.5.4 version is not compatible
- Python bindings for igraph, version 0.6 ( http://pypi.python.org/pypi/python-igraph )
- numpy library ( http://numpy.scipy.org/ )
- nosetests (only for testing) (http://readthedocs.org/docs/nose/en/latest/ )
- PyYAML (should be included in the default distro for python 2.7, otherwise (http://pypi.python.org/pypi/PyYAML/)
- h5py (optional)
- R 2.15.1 or higher
- distr library (to calculate p-values from simulations)
- ggplot2 library > 0.9.2.1 (to generate plots)
- biomaRt (only for downloading certaing SNP annotations)
- zoo (only for merging SNP annotations and results)
- sqldf (only for merging SNP annotations and results)
the pipeline uses other tools (tabix , vcftools, cosi), but everything is included in the repo, in the ./bin folder. Future versions of the pipeline will allow to specify the path to the tools, and use customized binaries.