1. Giovanni Marco Dall'Olio
  2. genotype_space


Human Genome Variation and the Concept of Genotype Networks


This repository contains code, datasets and results from the analysis presented in:

Dall'Olio GM, Bertranpetit J, Wagner A, Laayouni H.
Human Genome Variation and the Concept of Genotype Networks.
Available at http://arxiv.org/abs/1309.0657

It also includes python and R code that can be used to represent a set of genotype networks and do analysis on their properties.

Project outline, TO-DO list and bugs

Project Proposal and TO-DO lists are implemented in separate trello boards.

Bugs are usually posted to the TO-DO trello board, in one of the TO-DO columns.

Parameters and Running

The whole pipeline is encoded in the form of a Rakefile, included in the main directory. All the tasks are described in this file; to see a list of all possible tasks, type:

$: rake -T
rake boxplots_SNPannotations                     # Plot boxplots of values distribution, distinguish coding/noncoding SNPs
rake check_hub                                   # check the UCSC hub
rake convert_to_binary                           # Convert filtered vcf files to binary strings
rake count_features                              # Count number of CODING-NONCODING-LossOfFunction SNPs per window
rake filter_vcf                                  # Filter VCF files, removing indels, unphased, and applying maf filter
rake generate_hub                                # Generate UCSC hub
rake get_gene_coordinates                        # Get the gene coordinates from UCSC server
rake get_genotypes                               # Get the genotypes from 1000genomes
rake get_pvalues                                 # Get p-values from simulations
rake help                                        # show Help
rake launch_huge_sims                            # Launch Huge Sample Size simulations and convert them to binary format, on a Grid Engine Cluster
rake launch_sims                                 # Launch cosi simulations and convert them to binary format, on a Grid Engine Cluster
rake main                                        # Main task.
rake merge_annotations                           # Merge Annotations and Results
rake merge_network_properties                    # Merge network properties and gene coordinates in a single file
rake network_properties_report                   # Convert all binary files to binary graphs, save them to graphml, and generate a report
rake plot_multiple_samplesize                    # Plots of network properties distribution for different sample sizes
rake plot_multiple_windows                       # Plots of network properties distribution for different definitions of window size
rake random_sims                                 # Launch random simulations
rake random_simulations_background               # Get background from random simulations
rake random_simulations_background_bysamplesize  # Get background from random simulations by sample size
rake report_distance_definition                  # Report on how many components are formed for every possible value of distance_definition
rake report_multiple_distances                   # Launch generate_report script for many different values of distance
rake report_multiple_samplesize                  # Report of network properties by varying sample size
rake simulations                                 # Run Simulations
rake simulations_background                      # Get background from all simulations
rake simulations_to_binary                       # Convert all simulations to binary format
rake simulations_to_binary_cluster               # Convert all simulations to binary format, on the cluster
rake stats                                       # generate activity and punchcard plots for this repository
rake test                                        # Execute doctests and unittests
rake test_pvalue                                 # Test the algorithm to calculate p-values
rake touchy                                      # Utility task to touch all files and avoid executing the pipeline for small changes

# Note: The actual output may change in future versions

To run the whole pipeline, you will need to define first a main parameter file and a list of genes file (described below). Then, execute everything by invoking the main rule in the Rakefile:

$: rake main

Main Parameters File

The main parameters file contains all the options and definitions needed for the analysis.

It is defined as an YAML file. Example of parameters file:

    pathway_name: 'all_pathways'

        - 'EUR'
        - 'ASN'
        - 'AMR'
        - 'AFR'

    maf: 0.01
    minGQ: 0.00
    minDP: 1

By default, when you invoke the main Rakefile rule, it executes the file params/default.yaml. To use a different file, change the value of the variable params_file in the Rakefile.

List of genes file

You also need to provide a file specifying which genes you want to analyse.

The path to this file can be defined in the params_file described above, or inferred automatically from the value of the pathway_name variable.

The format of this file is simple: just three columns, one for the gene names (they must match the refGene table in UCSC), and two for pathway classification.

Example of genes list file:

#gene   pathway            subpathway
ALG3    N-glycosylation    precursor_biosynthesis
ALG10B  N-glycosylation    precursor_biosynthesis
DPAGT1  N-glycosylation    precursor_biosynthesis
ALG14   N-glycosylation    precursor_biosynthesis
ALG6    N-glycosylation    precursor_biosynthesis

Note that all lines beginning with a '#' are ignored.

Getting Gene coordinates

The next step is to get the coordinates of the genes in the previous file. You can do it quickly by invoking the get_gene_coordinates rule in the Rakeifile:

$: rake get_gene_coordinates

The coordinates are taken from the hg19 "knownGene" table in UCSC. Please note that multiple versions of this table may exists for a single hg release, so take note of the date when you downloaded the coordinates.

For a real analysis, it is also recommended to use the https://bitbucket.org/dalloliogm/ucsc-fetch/overview script to get the coordinates of all the regions, and check that they are correct

Note: A rakefile rule to automatize this task will be implemented soon.

Executing all the rest

As explained above, once you have defined a params_file and a genes_file, the whole pipeline can be executed by typing rake on the command line.


This repository is available at http://bitbucket.org/dalloliogm/genotype_space/overview

You may also want to check the repository https://bitbucket.org/dalloliogm/vcf2networks for a tool to easily calculate genotype networks from VCF files.


This is an imcomplete list of what you need to have installed in order for this pipeline to work:


  • Rake
  • yaml parser for rake (should be included in the default distro)



  • R 2.15.1 or higher
  • distr library (to calculate p-values from simulations)
  • ggplot2 library > (to generate plots)
  • data.table
  • plyr
  • reshape
  • yaml
  • biomaRt (only for downloading certaing SNP annotations)
  • zoo (only for merging SNP annotations and results)
  • sqldf (only for merging SNP annotations and results)

Other tools:

the pipeline uses other tools (tabix , vcftools, cosi), but everything is included in the repo, in the ./bin folder. Future versions of the pipeline will allow to specify the path to the tools, and use customized binaries.