HTTPS SSH

Haplostrips

Version 1.0

Davide Marnetto & Emilia Huerta-Sanchez

Contact: davide.marnetto@unito.it

Usage

haplostrips  [OPTIONS]

Produces plots that depict variants in a genomic window among different samples. Visualizes similarities between haplotypes with respect to a reference haplotype through haplotype clustering and sorting, useful for revealing hidden population structure.


Variant Input

One option among -v, -V, -H or -s is mandatory. -v or -V mean input from a VCF file, e.g. 1000 genomes VCF (gzipped files are okay, for information about the format see https://samtools.github.io/hts-specs/VCFv4.2.pdf). More VCF files can be supplied to the tool, which is capable of merging them. If a file is supplied with the -V option, and its variants are not present in other VCFs, its reference alleles are used to infer missing data in other files; if we use -v, on the contrary, variations that are absent in other files are skipped. For example you can supply 2 files A B with:

  • -v A -v B : inner join (only sites present in all will be used)

  • -V A -v B : left join (all sites in A plus those in B that exist in A)

  • -V A -V B : outer join (all sites will be kept)

you can supply also an higher number of files with more combinations (-V A -v B -v c, -v A -v B -v C -v D, ...).

VCF filenames can have '#CHR' instead of the chromosome name : e.g. with /path/chr#CHR.variations.vcf.gz the program will choose the correct chromosome (if it exists in the path) given the interval.

The option -H means that input is from a haps file (a particular format generated by haplostrips). This format has 4 fixed fields: #CHROM, POS, REF, ALT, plus one field for each haplotype. The file has a header, containing the names of the haplotypes, encoded as [name of sample]_[1 or 2] (e.g. NA12890_1), where the _1 and _2 denotes the two chromosomes of diploid samples.

Lastly, -s means input from ms output (http://home.uchicago.edu/rhudson1/source/mksamples/msdir/msdoc.pdf).

An interval must be supplied with -i or -m if a VCF is used. If the input is a haps file, or ms output, haplostrips plots by default the entire region, although a sub-interval can still be defined. If a tabix (.tbi) index file is available for a VCF file, haplostrips will use the package pysam (http://pysam.readthedocs.io/en/latest/api.html) to retrieve your window faster.

Haplotypes must be phased so if you supply a VCF make sure that it is phased.

IMPORTANT: This tool is useful for inspecting local patterns. Selecting a region that is too wide can make the interpretation and haplotype clustering difficult, to the point where the plot looses its meaning. Consequently Haplostrips is not optimized for large regions and should NOT be used for windows containing more than 10K sites (e.g. less than 300kb if you are using 1000 genomes phase 3 VCFs). Around 1K sites tends to provide good resolution, though this depends on the nature of the region that will be plotted and on the SNP density of your dataset. Also be aware that RAM usage is dependent on the dimension of the chosen window.


Populations

Populations are defined by a metafile (poptable) supplied with -P. By default metadata for the human populations from the 1000 genomes phase 3 project are already set up (for information see http://www.1000genomes.org/faq/which-populations-are-part-your-study). The "AltaiNea" and "DenisovaPinky" individuals are included with population "Nea" and "Den", respectively, and superpopulation "Archaic".

The supplied file should have 3 columns:

  1. ID in VCF

  2. population

  3. superpopulation

This metafile can also reflect different types of informations, e.g. cases vs controls, domestic vs wild, phenotypes... in the population field, as the field is being used to merely define a color on the heatmap sidebar. The standard population colors are inspired by ColorBrewer (http://colorbrewer2.org/) and enhances the visibility of the first and the second plotted populations, which are ideally a reference population and an outgroup. Note however that the default palette has size 12. If you choose to plot more than 12 populations, then you need to define a bigger palette with the -C option, otherwise an error will be thrown.


Multi mode

With -m you can supply a MULTIFILE that contains the information for each plot, with each line representing an independent Haplostrips run. The interval coordinates, populations and outfile prefix must be defined by the MULTIFILE but the remaining options can still be defined by flag.

from the help option (-h):

if -m / --multi: optional MULTIFILE contains the info for each plot you want, and should contain 5 fields:
        1       chr
        2       start
        3       stop
        4       populations 
        5       outfile prefix (. for not defined)
        ...     other fields will not be used

ms input is not supported by -m.


Dependencies

Haplostrips uses Python and R, so you need to have them installed, and has been tested with the following versions: Python 2.7.6, 2.7.11, 2.7.13 R 3.1.0, 3.2.5, 3.3.1, 3.3.2 Older versions and Python 3 may cause problems. Notice that you can choose the R instance to be used from haplostrips changing the value to the R_PATH variable in the haplostrips.config file. To test if you have R and python installed it should be sufficient to run R and python in your shell: a new session should start indicating the software versions, otherwise install them following their documentation. Note that, especially on Windows, R and python might not be in your PATH environment variable: in this case you need to find where they are in the computer and run them using the full path.

On Linux and iOS, dependencies are installed by:make install. To check dependencies without installing haplostrips you can use make check_dep

To correctly run Haplostrips you will need: - Python pandas package version 0.17.0 or newer

If you want to use tabix indexing to retrieve windows faster: - Python pysam package Automatically used if it finds .tbi files, to avoid this please rename the .tbi files involved.

Dependencies need pip Python package manager to be installed. If you do not have it on your system or you are not sure please visit:

https://pip.pypa.io/en/stable/installing/

With apt-get (e.g. Ubuntu), you can just run:

sudo apt-get install python-pip

or, for Red Hat:

sudo yum install python-pip

Installation

Haplostrips runs with a text interface, on Linux, iOS and Windows. To see details about haplostrips on Windows see the paragraph "Haplostrips on Windows", otherwise follow these instructions.

  1. Download the repository from the download area or clone it

  2. Open a text interface, position yourself in the haplostrips directory that you downloaded (e.g cd downloads/haplostrips/)

  3. Run:

    sudo make install

it installs by default into /usr/local/bin. If you want to provide a path different than /usr/local/bin, maybe because you do not have root or sudo permissions, you should modify the variable BIN_DIR in the haplostrips.config file changing the path. E.g:

BIN_DIR=/my/favourite/path/bin

The custom path should be in your $PATH environment since it creates executables. Note that the source files are copied in /usr/local/src or /my/favourite/path/src. If a src directory is not available, it will be created. The needed dependencies are automatically installed following the user confirmation, and are described above. Notice that you might need root or sudoer permissions to install the dependencies too, depending on your local system.

You can use sudo make clean to remove all installed files, if needed.


Haplostrips on Windows

The following instructions are tested on Windows 10. Older versions may cause problems. Windows has compatibility issues with pysam and the way we use to decompress .gz files, so please decompress files by hand and do not rely on Tabix indexes. These issues might be resolved in future versions of haplostrips. If you have Windows:

  1. Please download the haplostrips_executable directory and uncompress it. It includes haplostrips.exe created with py2exe (http://py2exe.org/), plus several source files needed. This .exe does not need an installation of haplostrips or Python.

  2. If you do not have R, install it. You can obtain it from here: https://cran.r-project.org/bin/windows/base/ . Notice that you can choose the R instance to be used from haplostrips changing the value to the R_PATH variable in the haplostrips.config file.

  3. Still, is a text interface tool, so you will need to open Windows Command Prompt (searching for cmd or other ways).

  4. From the command prompt, position yourself in the directory containing haplostrips.exe. (e.g cd downloads\haplostrips)

  5. Launch it from the command prompt. You can try with: haplostrips.exe -h to see a complete help message and verify that haplostrips is working.


Options

For additional information: haplostrips -h

Options:

  • -h, --help show this help message and exit
  • D, --debug debug option, set traceback limit to 5

Input Options:

  • -v FILE, --vcf_inner=FILE Input from a VCF file. If more VCFs are supplied and variations are not present in other VCFs, they are skipped.

  • -V FILE, --vcf_full=FILE Input from a VCF file. If more VCFs are supplied and ariations are not present in other VCFs, assume that all the samples from other files have the reference allele found in this one.

  • -H FILE|"multi", --haps=FILE|"multi" input from haps file FILE, if -m / --multi is specified, "-H multi" causes the program to look for *.haps files in the directory, where * is the prefix given in the MULTIFILE or the default prefix.

  • -s FILE, --ms_in=FILE input from ms simulation output FILE.

  • -i CHR:START-STOP, --interval=CHR:START-STOP interval to be plotted, provided as CHR:START-STOP. CHR must match exactly a chromosome name in your input, be careful with notations (e.g. chr1:START-STOP vs 1:START-STOP)

  • -m FILE, --multi=FILE you can supply a MULTIFILE to create multiple plots based on the information on each row (1 plot per row). Incompatible with -i,-p,-F,-o

  • -P FILE, --poptable=FILE file that contains samples and their populations. Each individual is its own row with 3 columns: ID in the VCF, population,superpopulation. Populations metadata for DenisovaPinky, AltaiNean and the 1000 genomes phase 3 samples are already available by default.

Output Options:

  • -o OUTPREFIX, --out=OUTPREFIX prefix of the output file, by default parse the arguments and give a meaningful name.

  • -d, --dists_tab create a tab delimited file that contain the distribution of distances with respect to the reference haplotype (two columns: distance, count).

Variant Options:

  • -a, --anc_allele polarize sites when the ancestral allele is available and discards non-polarized sites (by default REF allele is used). Note that it takes the ancestral allele from the AA= flag of the VCF INFO field.

  • -G CUTOFF, --GQ=CUTOFF minimum genotype quality, defaults to 40. Is taken from the genotype fields of the VCF file if the FORMAT field includes GQ, otherwise this filter is skipped.

  • -M CUTOFF, --MQ=CUTOFF minimum mapping quality, defaults to 30. Filtering performed in the same way as GQ (-G).

Plot Options:

  • -c CUTOFF, --min_priv_maf=CUTOFF remove sites with a private minor allele frequency under CUTOFF in all populations independently, defaults to 0.05.

  • -p REFPOP,OTHERPOPS, --pops=REFPOP,OTHERPOPS Specify the populations to plot separated by commas: the first is the reference population (for sorting). A group of populations can be specified using the superpopulation name.

  • -r HAPLOTYPE, --ref=HAPLOTYPE define a haplotype to consider as the reference: This should be a string that matches a haplotype name in the haps file. This option overrides the order given with -p/--pops.

  • -S 0|1|2|3, --hapsort=0|1|2|3 choose the sorting method for the haplotypes: 0 = no sorting; 1 = clustering and sorting based on distance from the reference (default); 2 = sorting based on distance from the reference; 3 = sorting for population and clustering.

  • -T, --tree draw a haplotype clustering tree (only compatible with sorting option -S 1).

  • -C COL1,COL2,...,COLN, --colors=COL1,COL2,...,COLN define an alternative color palette in a color format accepted by R, eg. "red,blue,orange" or "#060633,#E6F0FF,#E41A1C". The order is the same as the populations supplied with -p.


Testing Examples

These examples are made to test the correct installation of haplostrips, and are not explained in detail. For Usage Examples see below. You can run these examples from the 'examples' directory downloadable at https://bitbucket.org/davide_meldon/haplostrips/downloads

haplostrips -v chr22.1000genomes.example.vcf.gz -V chr22.archaic_hominids.example.vcf.gz -i 22:19500000-19540000 -a -o example_output -p Nea,YRI,EUR

-v chr22.1000genomes.example.vcf.gz : use this VCF as input. If sites present in this file are missing in other files, skip them.

-V chr22.archaic_hominids.example.vcf.gz : use this VCF as input. Any missing allelic states in the resulting merged file will be imputed from this file's reference.

-i 22:19500000-19540000 : region to be plotted

-a : polarize alleles using the ancestral in the INFO fields

-o example_output: the output file will have this prefix

-p Nea,YRI,EUR: the populations to be included

haplostrips -s <(ms 50 1 -t 37 -r 37 40000 -I 2 25 25 0.0000001 -ej 0.023 2 1) -C "darkred,darkblue" -c 0

-s ... : use ms (http://home.uchicago.edu/rhudson1/source/mksamples/msdir/msdoc.pdf) output as input

-C "darkred,darkblue" : define a new colour palette

-c 0 : set minimum private maf to 0, meaning that all sites in the input will be plotted even if fixed or rare in all populations of interest

Example files

  • chr22.1000genomes.example.vcf.gz
  • chr22.archaic_hominids.example.vcf.gz
  • chr22.archaic_hominids.example.vcf.gz.tbi
  • multi_list.example
  • additional_population.example

Usage Examples

These examples are explained step by step to illustrate use cases of haplostrips, but need several input files which are not directly downloadable from this source. The scientific data behind them come from works cited by the haplostrips paper.

1. Selection in LCT region

Introduction

We applied haplostrips to a region under positive selection in Europeans and associated with lactase persistence (LP). The region is the most striking example of recent positive selection and is known to have low recombination rates. We want to observe the haplotype carrying the mutations that in Europeans predicts the LP.

Building the Command

  1. We take as input data a phased genotype VCF from the 1000 genomes project, that we previously downloaded and put in our working directory, so our first option will be -v ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz, note that since we have only this input -v or -V are exchangeable.

  2. Our region of interest is the LCT gene and its closest gene, MCM6, which harbours the mutations responsible for LP, so we add the corresponding interval in the human genome with -i 2:136545844-136633962.

  3. We want to observe the haplotype structure in Europe, but, for comparison purposes, we want to add an Asian (CHB) and an African (YRI) populations. See Populations above for a reference to the 1000 genomes populations. The corresponding option will be -p CEU,YRI,FIN,GBR,IBS,TSI,CHB.

  4. We decide to use a color scheme that suggests the commonality between Europeans, making all the populations from Europe green-blue-ish. The option to specify a color for each population in -p is -C '#038c8c','#e6f0ff','#3dbcd9','#012e40','#013f22','#00b560','#89040d'.

  5. We want the snps to be polarized for ancestral/derived, so we add the flag -a.

  6. To give the prefix LCT to all output files we add -o LCT.

Finally we launch our command, which should look like this: haplostrips -i 2:136545844-136633962 -v ALL.chr2.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -i 2:136545844-136633962 -p CEU,YRI,FIN,GBR,IBS,TSI,CHB -C '#038c8c','#e6f0ff','#3dbcd9','#012e40','#013f22','#00b560','#89040d' -a -o LCT

Discussion of Output

The plot in the output file "LCT.pdf" is LCT.jpg The top part of the plot shows a pattern of alleles that is nearly completely constant. This is a mark of reduced diversity, consistent with positive selection or a bottleneck. A thick group of rows like this represents a haplotype at high frequency, and looking at the populations color labels we appreciate that is common in all European populations, but at very low frequency in Africans. East Asians carry an almost exact copy of the Northern European haplotype at moderate frequencies. However 2 out of 3 sites where they differ are rs4988235 and rs182549, which have previously been associated with lactose intolerance. This suggests that the haplotype background of this variant site originated before the European-Asian split, whereas the associated allele arose more recently. Interestingly, the Italian Toscani population possess a different and more variable set of haplotypes, consistent with a higher incidence of lactose intolerance in this population.

2. Candidates of Adaptive Introgression

Introduction

In a work cited by the haplostrips paper, the 1000 Genomes dataset has been examined to identify regions that were likely subject to adaptive introgression. This scan of adaptive introgression did not include haplotype based information directly,therefore haplostrips allows us to inspect the haplotype patterns of the candidate loci to qualitatively confirm the archaic origin of the modern human haplotype. Below we give an example of one candidate locus (IFIH1) which has a striking archaic-derived haplotype pattern in American populations.

Building the Command

  1. We need as input data a phased genotype VCF from the 1000 genomes project, as above. Since we want to analyze potential cases of introgression from archaic hominins, this time we want to include their variation data in the plot. We have the needed data stored in a genotype VCF file called "AltaiNea_Den_combined.*.vcf.gz". So we will specify the two inputs and, since we want to be sure that archaic-specific sites divergent from the human reference are included in the plot, we will use a -V option for this input. This will assume that all modern humans harbour the reference allele in all SNPs present only in the archaic hominin dataset. Since we have several regions of interest in different chromosomes, and often these VCF files are chromosome specific, another important detail is to use the '#CHR' word in place of the chromosome symbol for the input files. If the files differ only for this part of the name (including, of course, their path), haplostrips will be able to choose for each window the right chromosome. So we have -v ALL.chr#CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -V AltaiNea_Den_combined.#CHR.vcf.gz.

  2. 1000 genomes and archaic hominin populations are already understood by haplostrips by default. Let's assume that the second dataset we supply includes samples without any population metadata. We need to add them through the option -P which allows us to indicate a file, called AltaiNea_pops, that contains the following tab-delimited text:

AltaiNea        Nea     Archaic
DenisovaPinky   Den     Archaic

The first field corresponds to the sample names present in the header of the VCF that we used as input, followed by population and super-population that we will use to call them. The option will be -P AltaiNea_pops.

  1. We have several regions to investigate, each one of them is interesting in different populations, so we decide to use the "multi" option of haplostrips. With multiple windows this option is less time-consuming for the user. The description of windows, populations and output names (normally submitted with options -i, -p and -o respectively), will be provided in a file called AI_regions, that contains the following tab-delimited text:
2       163040001       163120000       Nea,YRI,PEL,MXL,PUR     FAP_IFIH1
9       16720001        16760000        Nea,YRI,EUR     BNC2
5       145480001       145520000       Nea,YRI,AMR     LARS
12      113360001       113400000       Nea,YRI,EUR     OAS1
11      120120001       120200000       Nea,YRI,EAS     POU2F3
10      90920001        90980000        Den,YRI,EAS     LIPA_CH25H_EAS
10      90920001        90980000        Den,YRI,AMR     LIPA_CH25H_AMR

and will be submitted with the option -m AI_regions. Note that we included continental populations that show a high score in our AI statistic and included YRI as a representative African population. Using a super-population name, as in these cases, will include all populations which are part of the super-population. Lastly, since we want to see the differences from the archaic hominins, we cited them as first population, or reference population.

  1. We want the SNPs to be polarized for ancestral/derived, so we add the flag -a.

Finally we launch our command, which should look like this:

haplostrips -v ALL.chr#CHR.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -V AltaiNea_Den_combined.#CHR.vcf.gz -m AI_regions -a

Discussion of Output

Let's discuss the plot of one of the windows of interest, contained in the file "FAP_IFIH1.pdf": FAP_IFIH1.jpg in this case it is interesting to see the plot with the number of differences between the reference and the sampled haplotypes: FAP_IFIH1.dist.jpg At the very top we have the reference population (Nea), which is composed of just 2 haplotypes coming from the only Neanderthal genome in our dataset.
Then, clustered and ordered for increasing distance from the reference, we appreciate the modern human haplotypes, subdivided in about three families. The first group of haplotypes, involving ~165 of them, is very similar to the Neanderthal ones and is abundant in Peruvians(PEL), present in Mexicans(MXL) and Puertoricans(PUR), and completely absent in Africans(YRI). The small distance to Neanderthal is visible also form the second plot, where we see the first ~165 haplotypes really close to the reference (< 10 SNP differences), followed by the rest at a much greater genetic distance (apart from 2 samples all are >60 SNP differences). We know that the North and South American populations sampled in the 1000G dataset are admixed and they have Native American, European and African ancestry. In our analysis, Peruvians are the population with the highest amount of Native American ancestry. What we observe seems then consistent with a story of introgression from Neanderthal to the ancestors of Native Americans, happened before their crossing of the Beringia strait, and that rose in frequency because of an adaptive advantage or a population bottleneck. Africans never carry a haplotype of this family, consistent with its Neanderthal origin, and indeed its similarity with the Neanderthal sequence is really striking.

3. Simulated Data

Introduction

We can also use haplostrips on simulated data. The benefit of this, is that we can actually see how different models of positive selection (e.g. selection on a de novo mutation) or neutrality affect the length, frequency of haplotypes and the haplotype similarity. Applying haplostrips to simulation data can help us qualify what models do and do not fit the patterns observed in real data because we can build intuition on how specific evolutionary models affect the sequences of variation. Here we have chosen to illustrate simulated data under a model of positive selection generated by the msms coalescent simulator with selection (Ewing G. and Hermisson J. (2010) Bioinformatics 26 : 2064-2065). on a new mutation. Specifically, we simulate a two population model using demographic parameters estimated in Gravel et al. [CITATION]. The new beneficial mutation arises in population 2 after the split with population 1. This is the msms command that has been used:

msms -N 10000 -ms 200 1 -I 2 100 100 0 -t 68 -r 68 100000 -g 2 191.541 -n 2 4.536972 -n 1 1.4474 -m 1 2 0.312 -m 2 1 0.312 -ema 0.023 2 x 6 6 x -en 0.023 2 0.1861 -ej 0.051 2 1 -en 0.148 1 0.731 -SFC -SI 0.015 2 0 0.000194976331884185 -Sc 0 1 0 0 0 -Sc 0 2 128.922648937441 128.922648937441 0 -Sp 0.8 -Smark > msms_out.txt

We repeated the simulation using an higher selection coefficient to the option -Sc: -Sc 0 2 469.607749511488 469.607749511488 0

Building the command

  1. the input is a simulation result with a ms format, named msms_out, so we use: -s msms_out

  2. we set the output name: -o msms_out

  3. By default all the populations are plotted: we have only 2 populations in our simulation and we might just omit the -p option. But we want to be sure that the typical haplotype of population 2 (where the selected allele is present) appears on top. So we set it as reference population using: -p 2,1. Notice that we do not need to give population metadata and that ms populations are named with progressive numbers.

Finally we have: haplostrips -s msms_out -o msms_out -p 2,1

We will launch it 2 times, one for each simulation, after modifying the selection coefficient.

Discussion of Output

Let's observe with haplostrips the result of the two simulations, with increasing selection coefficient:

msms_New_out1.jpg

msms_New_out3.jpg

We can see that the selected haplotype increases in frequency as a function of increasing selection strength in population 2.