Clone wiki

biobakery / halla

HAllA tutorial

HAllA (Hierarchical All-against-All association) is a tool to find multi-resolution associations in high-dimensional, heterogeneous datasets. For a pair of datasets containing measurements that describe the same set of samples, Hierarchical All-against-All Association (HAllA) testing proceeds by 1) discretizing features to a uniform representation, 2) hierarchically clustering each dataset separately to generate two data hierarchies, 3) coupling clusters of equivalent resolution between the two data hierarchies, and 4) iteratively testing coupled clusters of increasing resolution for statistically significant association.

HAllA inputs. Data in scientific studies often come paired in the form of two high-dimensional datasets, where the dataset X (with p features/rows and n samples/columns) are assumed to be p predictor variables (or features) measured on n samples that give rise to d response variables contained in the dataset Y (with d features/rows and n samples/columns). Note that column i of X is sampled jointly with column i of Y, so that X and Y are aligned.

HAllA output. HAllA reports significant associations between clusters of related features. Each association is characterized by a cluster from the first dataset, a cluster from the second dataset, and measures of statistical significance and the effect size of the association between the clusters (by p-value, q-value, and similarity score.



Requirements


Installation

HomeBrew

You can install HAllA and other bioBakery tools automatically with HomeBrew for MacOS or LinuxBrew for Linux platforms.

$ brew tap biobakery/biobakery
$ brew install halla

This will also install all HAllA dependencies.

pip install

You can install HAllA automatically with pip.

$ pip install halla

This will install the latest version of HAllA and all its dependencies.

From Source

Alternatively, you can manually install HAllA from source and then manually install the dependencies.

Step 1: Download HAllA and unpack the software:

$ tar xzvf biobakery-halla-<versionid>.tar.gz
$ cd biobakery-halla-<versionid>/

Step 2: Install HAllA:

$ python setup.py install

Add the --user option if you do not have root install permissions.

Step 3: Install the HAllA dependencies.


How to run

Synthetic data

HAllA requires as input two tab-delimited text files representing two paired datasets describing the same set of samples. Download the set of two files to get started on the tutorial (click on the link then right-click on the "Save as..." option on the preview page to download the files), and in this tutorial, let's assume you save the file in Download directory (the default place usually).

These two files contain 16 normally-distributed features for 100 samples (all synthetic data). Cluster structure was spiked into each dataset, and some clusters were forced to be associated (for demonstration purposes).

Next, run HAllA on the two demo input files, placing the output files in your current working directory under synthetic_output:

$ cd ~/Downloads
$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output

HAllA uses Spearman's rank correlation as the default similarity metric for continuous data, and if there is at least one categorical data it will uses Normalized Mutual Information (NMI) as similarity metric to compare features. If you would like to run with multiple cores, add the option --nproc. The --fdr option can be used to define the false discovery rate (FDR) procedure. "bh" refers to Benjamini-Hochberg FDR correction.

The command above creates three primary output files:

  • associations.txt
  • similarity_table.txt
  • hypotheses_tree.txt

Let's examine these files individually, starting with associations.txt:

$ column -t -s $'\t' synthetic_output/associations.txt | less -S

This yields:

association_rank  cluster1    cluster1_similarity_score  cluster2    cluster2_similarity_score  pvalue                  qvalue
1                 X12;X13     0.73301320528211289        Y12;Y13     0.74434573829531803        2.1564742078882269e-14  7.7633071483976174e-13
2                 X14;X15     0.76941176470588246        Y14;Y15     0.65138055222088831        1.5622528336072545e-13  2.8120551004930578e-12
3                 X9;X10;X11  0.54002400960384145        Y9;Y10;Y11  0.67788715486194473        7.8628592554387596e-12  9.4354311065265122e-11
4                 X6;X7;X8    0.64297719087635052        Y6;Y7;Y8    0.5583193277310925         9.5425798025534227e-07  8.5883218222980804e-06
5                 X0;X1;X2    0.63030012004801916        Y0;Y1;Y2    0.63087635054021607        1.7661674082593499e-06  1.2716405339467318e-05
6                 X3;X4;X5    0.56230492196878745        Y3;Y4;Y5    0.68139255702280921        4.0472115008008864e-06  2.4283269004805316e-05
7                 X15         1.0                        Y6          1.0                        0.00027968510022185388  0.0087101931211948768

This file reports associations between clusters, as described above.


Now let's examine similarity_table.txt:

$ column -t -s $'\t' synthetic_output/similarity_table.txt | less -S

This yields:

#    Y12                Y13                Y14               Y15               Y6                Y7                Y8                 Y3
X12  0.851140456182     0.613829531813     0.0271308523409   -0.154477791116   0.00840336134454  -0.0623769507803  -0.226410564226    0.194525
X13  0.659927971188     0.840864345738     0.266458583433    0.0606482593037   -0.0209843937575  -0.0480672268908  -0.195774309724    0.033277
X0   -0.315342136855    -0.124321728691    -0.205378151261   -0.192412965186   -0.0327971188475  -0.253301320528   0.00523409363745   0.129219
X1   0.00792316926771   -0.0368307322929   -0.258967587035   -0.221032412965   -0.0811044417767  -0.0965666266507  -0.0492196878752   0.284417
X2   0.0295318127251    0.0791836734694    -0.0875390156062  -0.0306842737095  -0.0199279711885  -0.070156062425   0.0160864345738    0.110876
X3   -0.0569027611044   -0.187034813926    -0.0319327731092  -0.189339735894   -0.127779111645   -0.0739015606242  0.0584393757503    0.636590
X4   0.0462424969988    0.0257863145258    0.151692677071    0.0448019207683   0.0584393757503   -0.0641056422569  0.117599039616     0.477454
X5   0.116254501801     0.0631452581032    0.14612244898     0.154285714286    0.208547418968    0.121152460984    0.224297719088     0.309387
X9   0.121824729892     -0.0852340936375   0.0276110444178   -0.133733493397   -0.161296518607   -0.0427851140456  -0.0642016806723   0.311596
X10  0.0976230492197    -0.086962785114    0.0378871548619   -0.0967587034814  -0.0316446578631  -0.0776470588235  -0.00600240096038  -0.00792
X11  0.0653541416567    -0.00888355342137  0.0084993997599   -0.0721728691477  -0.0365426170468  0.0275150060024   0.0840816326531    0.096086
X14  0.0838895558223    0.361152460984     0.814549819928    0.610852340936    0.389771908764    0.143049219688    0.0766866746699    -0.33262
X15  0.0668907563025    0.258199279712     0.580984393758    0.825882352941    0.492533013205    0.316014405762    0.304009603842     -0.28720
X6   -0.00782713085234  0.0111884753902    0.295270108043    0.278175270108    0.658967587035    0.511356542617    0.291428571429     -0.38977
X7   -0.0905162064826   -0.0163745498199   0.0460504201681   0.148523409364    0.384585834334    0.629867947179    0.327442977191     -0.31303
X8   -0.138151260504    -0.0952220888355   0.0147418967587   0.205378151261    0.269339735894    0.547755102041    0.544489795918     -0.34386

This file contains pairwise similarity scores for all pairs of features from the first dataset and the second dataset.


Option --write-hypothesis-tree can be used to write hypothesis tree with the halla command. The hypothesis tree will be in the hypotheses_tree.txt file:

$ column -t -s $'\t' synthetic_output/hypotheses_tree.txt | less -S

This yields:

Level  Dataset 1                                              Dataset 2
0      X12;X13;X0;X1;X2;X3;X4;X5;X9;X10;X11;X14;X15;X6;X7;X8  Y12;Y13;Y14;Y15;Y6;Y7;Y8;Y3;Y4;Y5;Y9;Y10;Y11;Y0;Y1;Y2
1      X9;X10;X11                                             Y3;Y4;Y5
1      X9;X10;X11                                             Y12;Y13
1      X9;X10;X11                                             Y0;Y1;Y2
1      X9;X10;X11                                             Y9;Y10;Y11
1      X9;X10;X11                                             Y6;Y7;Y8
1      X9;X10;X11                                             Y14;Y15
1      X12;X13                                                Y3;Y4;Y5
1      X12;X13                                                Y12;Y13
1      X12;X13                                                Y0;Y1;Y2
1      X12;X13                                                Y9;Y10;Y11
1      X12;X13                                                Y6;Y7;Y8
1      X12;X13                                                Y14;Y15
1      X3;X4;X5                                               Y3;Y4;Y5
1      X3;X4;X5                                               Y12;Y13
1      X3;X4;X5                                               Y0;Y1;Y2
1      X3;X4;X5                                               Y9;Y10;Y11
1      X3;X4;X5                                               Y6;Y7;Y8
1      X3;X4;X5                                               Y14;Y15
1      X0;X1;X2                                               Y3;Y4;Y5
1      X0;X1;X2                                               Y12;Y13
1      X0;X1;X2                                               Y0;Y1;Y2
1      X0;X1;X2                                               Y9;Y10;Y11
1      X0;X1;X2                                               Y6;Y7;Y8
1      X0;X1;X2                                               Y14;Y15
1      X6;X7;X8                                               Y3;Y4;Y5
1      X6;X7;X8                                               Y12;Y13
1      X6;X7;X8                                               Y0;Y1;Y2
1      X6;X7;X8                                               Y9;Y10;Y11
1      X6;X7;X8                                               Y6;Y7;Y8
1      X6;X7;X8                                               Y14;Y15
1      X14;X15                                                Y3;Y4;Y5
1      X14;X15                                                Y12;Y13
1      X14;X15                                                Y0;Y1;Y2
1      X14;X15                                                Y9;Y10;Y11
1      X14;X15                                                Y6;Y7;Y8
1      X14;X15                                                Y14;Y15
2      X9                                                     Y3
2      X9                                                     Y5
and continues

This file contains a comprehensive report of all testing performed during the HAllA run (not limited to the significant associations reported in associations.txt. Level zero hold all the features in HAllA starts performing tests from level 1.


  • What is the pairwise similarity score between feature X9 and feature Y11 in association number 3? How does this score compare the similarity score given for association number 3?
  • What is the pairwise similarity score between X15 and Y6? How does this compare to the strength of association considered above? How is this difference reflected in the *p*-value and *q*-value of the test for these two features?

Human gut microbiome versus host transcriptome in ulcerative colitis

Here we will consider subsets of a published dataset (Morgan et al., Genome Biology 2015) that combined 1) 16S rRNA amplicon sequencing of the human gut microbiome (64 taxa) and 2) Affymetrix microarray screens of colonic RNA expression across 204 patients with ulcerative colitis (100 genes). We will refer to this as the "pouchitis dataset." The purpose of this study was to associate human genes and microbial taxa with the recurrence of inflammation following ileal resection surgery (a surgical procedure in ulcerative colitis that involves removing of the large intestine and rectum and attaching the lowest part of the small intestine to a hole made in the abdominal wall to allow waste to leave the body).

Download the paired, subsampled OTU-gene datasets:

Run HAllA on these datasets:

$ halla -X otu_299.txt -Y gene_200.txt -o pouchitis_output -m spearman --header -q 0.05

Note the addition of the "-q" flag: this flag defines the target FDR, here 0.05, i.e. the expected fraction of false positive reports among returned significant associations. --header uses the header of the two datasets to find common columns (samples) and reorder them.


Visualizing HAllA results

hallagram is a tool included with HAllA for visualizing the three output files we looked at in text-form above. Run hallagram as follows (use hallagram -h for help with plot options):

$ cd synthetic_output
$ hallagram similarity_table.txt hypotheses_tree.txt associations.txt --outfile hallagram.png

Please open the file hallagram_strongest_7.png, which should look like this:

hallagram_strongest_7.png
  • How many features are involved in the largest association in the figure? How many pairwise associations does this cluster association represent? Does it appear as though the pairwise associations are reasonably homogeneous in terms of their strength?
  • Are there any pairs of clusters with a significant negative association?
  • Do you think that HAllA's approach improved statistical power in this scenario? How would power be different if all X and Y features were compared individually?

Let's try some of the other hallagram options using the pouchitis dataset (gut OTUs and host gene expression):

$ cd pouchitis_output
$ hallagram similarity_table.txt hypotheses_tree.txt associations.txt --outfile hallagram.pdf --outfile hallagram.png --similarity Spearman --axlabels "Microbial OTUs" "Host transcripts" --strongest 50
  • --similarity option names the similarity methods has been used in this analysis in the legend.
  • --axlabels option add X-axis label and Y-axis label.
  • --strongest 30 option to used 50strongest associations order by similarity score(--order-by can be used to use pvalue or qvalue for order instead of similarity score.
hallagram_30.png
  • How would you interpret association number 10?
  • How would your answers regarding the previous hallagram change here?

HAllA extensions

Pairwise association testing

HAllA uses a hierarchical approach for blockwise association testing between clusters in different levels in hierarchies. Naive all-aginst-all (AllA), pairwise, association testing can be used by -a AllA. The default is -a HAllA. Let's try AllA approach on the synthetic data:

$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_alla -m spearman -a AllA

  • What has changed in the output files?

Similarity metrics

HAllA is extensible to similarity metrics. By default, HAllA uses normalized mutual information (NMI) and discretizes datasets to use NMI. We recommend using Spearman coefficient if all data are continuous, with a small number of samples, and looking for monotonic associations. Adjusted mutual information (AMI), maximum information coefficient (MIC), Pearson, distance correlation (dcor), and discretized maximum information coefficient (DMIC) are the other similarity metrics that currently are implemented in HAllA. A similarity metric is used to build hierarchical clusters and also to be used in permutation test in HAllA. -m $SIMILARTITY_METRIC is the option to be used with HAllA command line. Let's try dcor as similarity metric:

$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_dcor -m dcor

  • What has changed in the output files?

Multiple tests correction (FDR) methods

Benjamini–Hochberg(BH) as default, Benjamini–Yekutieli (BY), and Bonferroni are implemented as multiple test correction methods in HAllA. BH is used as the default in HAllA and can be changed by --fdr option. Let's try bonferroni:

$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_bonferroni -m spearman --fdr bonferroni

  • What has changed in the output files?
  • How the results change in AllA case with Bonferroni?

Decomposition (representative) methods

Medoid of a cluster is used as a representative to participate in association testing. Decomposition methods such as principal component analysis (PCA), multiple correspondence analysis (MCA), independent component analysis (ICA).PCA and ICA can be used only with continuous data and when looking for monotonic relations. With PCA or ICA, the first component will be used as the representative of a cluster. MCA can be used for both categorical and continuous data. Let's try pca:

$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_pca -m spearman -d pca

Power evaluation for different data types and similarity metrics

Let's start with mixed data and try different similarity metrics: The following command generates a pair of datasets with mixed (categorical and continuous) data of 32 features and 100 sample, with uniform distribution an balanced clusters:

halladata -f 32 -n 100 -a mixed -d uniform -s balanced -o halla_data_f32_n100_mixed

First Try to run it by specifying the similarity metric:

halla -X halla_data_f32_n100_mixed/X_mixed_32_100.txt -Y halla_data_f32_n100_mixed/Y_mixed_32_100.txt -o halla_output_f32_n100_mixed -m spearman

What happens and why?

Let the tool decide the similarity metric:

halla -X halla_data_f32_n100_mixed/X_mixed_32_100.txt -Y halla_data_f32_n100_mixed/Y_mixed_32_100.txt -o halla_output_f32_n100_mixed

What is the discretizing method and what it does?

Now, let's generate synthetic continuous data with a linear relation between and with datasets and see how Spearman coefficient versus normalized mutual information report the significant association.

halladata -f 32 -n 50 -a line -d uniform -s balanced -o halla_data_f32_n50_line

Run HAllA with default similarity metric (will use spearman as all data are continuous):

halla -X halla_data_f32_n50_line/X_line_32_50.txt -Y halla_data_f32_n50_line/Y_line_32_50.txt -o halla_output_f32_n50_line_spearman

Run HAllA with NMI similarity metric (will discretize data):

halla -X halla_data_f32_n50_line/X_line_32_50.txt -Y halla_data_f32_n50_line/Y_line_32_50.txt -o halla_output_f32_n50_line_nmi -m nmi

Open the hallagram from the two last runs and see how they are different? What do you think causes this?


Updated