1. bioBakery
  2. Untitled Project
  3. biobakery

Wiki

Clone wiki

biobakery / halla

HAllA tutorial

HAllA (Hierarchical All-against-All association) is a tool to find multi-resolution associations in high-dimensional, heterogeneous datasets. For a pair of datasets containing measurements that describe the same set of samples, Hierarchical All-against-All Association (HAllA) testing proceeds by 1) discretizing features to a uniform representation, 2) hierarchically clustering each dataset separately to generate two data hierarchies, 3) coupling clusters of equivalent resolution between the two data hierarchies, and 4) iteratively testing coupled clusters of increasing resolution for statistically significant association.

HAllA inputs. Data in scientific studies often come paired in the form of two high-dimensional datasets, where the dataset X (with p features/rows and n samples/columns) are assumed to be p predictor variables (or features) measured on n samples that give rise to d response variables contained in the dataset Y (with d features/rows and n samples/columns). Note that column i of X is sampled jointly with column i of Y, so that X and Y are aligned.

HAllA output. HAllA reports significant associations between clusters of related features. Each association is characterized by a cluster from the first dataset, a cluster from the second dataset, and measures of statistical significance and the effect size of the association between the clusters (by p-value, q-value, and similarity score.



Requirements


Installation

HomeBrew

You can install HAllA and other bioBakery tools automatically with HomeBrew for MacOS or LinuxBrew for Linux platforms.

$ brew tap biobakery/biobakery
$ brew install halla

This will also install all HAllA dependencies.

pip install

You can install HAllA automatically with pip.

$ pip install halla

This will install the latest version of HAllA and all its dependencies.

From Source

Alternatively, you can manually install HAllA from source and then manually install the dependencies.

Step 1: Download HAllA and unpack the software:

$ tar xzvf biobakery-halla-<versionid>.tar.gz
$ cd biobakery-halla-<versionid>/

Step 2: Install HAllA:

$ python setup.py install

Add the --user option if you do not have root install permissions.

Step 3: Install the HAllA dependencies.


How to run

Synthetic data

HAllA requires as input two tab-delimited text files representing two paired datasets describing the same set of samples. Download the set of two files to get started on the tutorial (click on the link then right-click on the "Save as..." option on the preview page to download the files).

These two files contain 16 normally-distributed features for 100 samples (all synthetic data). Cluster structure was spiked into each dataset, and some clusters were forced to be associated (for demonstration purposes).

Next, run HAllA on the two demo input files, placing the output files in your current working directory under synthetic_output:

$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output -m spearman

The "-m" option is used to request a measure of similarity between features. This measure is used both for identifying clusters within each dataset, as well as to identify associations between clusters. Here, "spearman" indicates Spearman's rank correlation. If you would like to run with multiple cores, add the option --nproc. The --fdr option can be used to define the false discovery rate (FDR) procedure. "bh" refers to Benjamini-Hochberg FDR correction.

The command above creates three primary output files, which can also be downloaded from the links below:

Let's examine these files individually, starting with associations.txt:

$ less -S synthetic_output/associations.txt

This yields:

association_rank    cluster1         cluster1_similarity_score      cluster2              cluster2_similarity_score pvalue                 qvalue                  similarity_score_between_clusters
1                   X12;X13          0.73301320528211289            Y12;Y13                0.74434573829531803      0.0                    0.0                      0.84086434573829538
2                   X14;X15          0.76941176470588246            Y14;Y15                0.65138055222088831      1.3073481596959805e-19 9.4129067498110597e-19   0.82588235294117651
3                   X9;X10;X11       0.61162064825930362            Y9;Y10;Y11             0.7375270108043217       0.0                    0.0                      0.79140456182472985
4                   X6;X7;X8         0.72830732292917166            Y6;Y7;Y8               0.64638655462184869      1.4285081274091766e-07 8.57104876445506e-07     0.6298679471788714
5                   X0;X1;X2         0.70861944777911168            Y0;Y1;Y2               0.6923889555822329       0.0                    0.0                      0.61767106842737096
6                   X3;X4;X5         0.64427370948379348            Y3;Y4;Y5               0.74597839135654265      0.0                    0.0                      0.60038415366146469
7                   X15               1.0                           Y6                     1.0                      0.00028074428246968331  0.0087431790826272802   0.49253301320528209

This file reports associations between clusters, as described above.


Now let's examine similarity_table.txt:

$ less -S synthetic_output/similarity_table.txt

This yields:

#       Y0      Y1      Y2      Y12     Y13     Y14     Y15     Y6      Y7      Y8      Y3      Y4      Y5      Y9      Y10     Y11
X9      -0.298  -0.105  -0.141   0.122  -0.085   0.028  -0.134  -0.161  -0.043  -0.064   0.312   0.341   0.357   0.574   0.403   0.245
X10     -0.186  -0.033   0.004   0.098  -0.087   0.038  -0.097  -0.032  -0.078  -0.006  -0.008   0.051   0.052   0.551   0.791   0.638
X11     -0.109   0.045   0.025   0.065  -0.009   0.008  -0.072  -0.037   0.028   0.084   0.096   0.092  -0.030   0.381   0.559   0.708
X0       0.759   0.475   0.164  -0.315  -0.124  -0.205  -0.192  -0.033  -0.253   0.005   0.129   0.181   0.180  -0.069  -0.129   0.018
X1       0.486   0.618   0.250   0.008  -0.037  -0.259  -0.221  -0.081  -0.097  -0.049   0.284   0.238   0.283   0.150   0.058   0.204
X2       0.522   0.558   0.535   0.030   0.079  -0.088  -0.031  -0.020  -0.070   0.016   0.111   0.064   0.122   0.076   0.020   0.168
X3       0.048   0.034   0.012  -0.057  -0.187  -0.032  -0.189  -0.128  -0.074   0.058   0.637   0.433   0.342   0.101   0.036   0.009
X4       0.219   0.184   0.050   0.046   0.026   0.152   0.045   0.058  -0.064   0.118   0.477   0.600   0.549   0.216   0.144   0.093
X5       0.289   0.341   0.219   0.116   0.063   0.146   0.154   0.209   0.121   0.224   0.309   0.424   0.634   0.294   0.144   0.148
X12     -0.101  -0.009   0.081   0.851   0.614   0.027  -0.154   0.008  -0.062  -0.226   0.195   0.093   0.179   0.260   0.214   0.132
X13      0.021   0.038   0.160   0.660   0.841   0.266   0.061  -0.021  -0.048  -0.196   0.033  -0.020   0.093   0.212   0.096   0.065
X14      0.013  -0.093  -0.015   0.084   0.361   0.815   0.611   0.390   0.143   0.077  -0.333  -0.199  -0.169   0.033  -0.073  -0.103
X15     -0.086  -0.140  -0.175   0.067   0.258   0.581   0.826   0.493   0.316   0.304  -0.287  -0.109   0.014  -0.020  -0.110  -0.146
X6      -0.076  -0.016  -0.048  -0.008   0.011   0.295   0.278   0.659   0.511   0.291  -0.390  -0.099  -0.089   0.136   0.027  -0.011
X7      -0.082   0.064  -0.023  -0.091  -0.016   0.046   0.149   0.385   0.630   0.327  -0.313  -0.100  -0.019   0.228   0.038   0.085
X8      -0.054   0.023   0.006  -0.138  -0.095   0.015   0.205   0.269   0.548   0.544  -0.344  -0.163  -0.049   0.065   0.059   0.104

This file contains pairwise similarity scores for all pairs of features from the first dataset and the second dataset.


Option --write-hypothesis-tree can be used to write hypothesis tree with the halla command. The hypothesis tree will be in the hypotheses_tree.txt file:

$ less -S synthetic_output/hypotheses_tree.txt

This yields:

Level   Dataset 1       Dataset 2
0       X9;X10;X11;X0;X1;X2;X3;X4;X5;X12;X13;X14;X15;X6;X7;X8           Y0;Y1;Y2;Y12;Y13;Y14;Y15;Y6;Y7;Y8;Y3;Y4;Y5;Y9;Y10;Y11
1       X9;X10;X11      Y0;Y1;Y2
1       X9;X10;X11      Y12;Y13
1       X9;X10;X11      Y9;Y10;Y11
1       X9;X10;X11      Y3;Y4;Y5
1       X9;X10;X11      Y6;Y7;Y8
1       X9;X10;X11      Y14;Y15
1       X12;X13         Y0;Y1;Y2
1       X12;X13         Y12;Y13
1       X12;X13         Y9;Y10;Y11
1       X12;X13         Y3;Y4;Y5
1       X12;X13         Y6;Y7;Y8
1       X12;X13         Y14;Y15
1       X3;X4;X5        Y0;Y1;Y2
1       X3;X4;X5        Y12;Y13
1       X3;X4;X5        Y9;Y10;Y11
1       X3;X4;X5        Y3;Y4;Y5
1       X3;X4;X5        Y6;Y7;Y8
1       X3;X4;X5        Y14;Y15
1       X0;X1;X2        Y0;Y1;Y2
1       X0;X1;X2        Y12;Y13
1       X0;X1;X2        Y9;Y10;Y11
1       X0;X1;X2        Y3;Y4;Y5
1       X0;X1;X2        Y6;Y7;Y8
1       X0;X1;X2        Y14;Y15
1       X6;X7;X8        Y0;Y1;Y2
1       X6;X7;X8        Y12;Y13
1       X6;X7;X8        Y9;Y10;Y11
1       X6;X7;X8        Y3;Y4;Y5
1       X6;X7;X8        Y6;Y7;Y8
1       X6;X7;X8        Y14;Y15
1       X14;X15         Y0;Y1;Y2
1       X14;X15         Y12;Y13
1       X14;X15         Y9;Y10;Y11
1       X14;X15         Y3;Y4;Y5
1       X14;X15         Y6;Y7;Y8
1       X14;X15         Y14;Y15
2       X9              Y0
2       X9              Y2
2       X9              Y1
2       X11             Y0
2       X11             Y2
2       X11             Y1
2       X10             Y0
2       X10             Y2
2       X10             Y1

This file contains a comprehensive report of all testing performed during the HAllA run (not limited to the significant associations reported in associations.txt. Level zero hold all the features in HAllA starts performing tests from level 1.


  • What is the pairwise similarity score between feature X9 and feature Y11 in association number 3? How does this score compare the similarity score given for association number 3?
  • What is the pairwise similarity score between X15 and Y6? How does this compare to the strength of association considered above? How is this difference reflected in the *p*-value and *q*-value of the test for these two features?

Human gut microbiome versus host transcriptome in ulcerative colitis

Here we will consider subsets of a published dataset (Morgan et al., Genome Biology 2015) that combined 1) 16S rRNA amplicon sequencing of the human gut microbiome (64 taxa) and 2) Affymetrix microarray screens of colonic RNA expression across 204 patients with ulcerative colitis (100 genes). We will refer to this as the "pouchitis dataset." The purpose of this study was to associate human genes and microbial taxa with the recurrence of inflammation following ileal resection surgery (a surgical procedure in ulcerative colitis that involves removing of the large intestine and rectum and attaching the lowest part of the small intestine to a hole made in the abdominal wall to allow waste to leave the body).

Download the paired, subsampled OTU-gene datasets:

Run HAllA on these datasets:

$ halla -X OTU_64.txt -Y Gene_100.txt -o pouchitis_output -m spearman --header -q 0.25

Note the addition of the "-q" flag: this flag defines the target FDR, here 0.25, i.e. the expected fraction of false positive reports among returned significant associations. --header uses the header of the two datasets to find common columns (samples) and reorder them.

The output files generated from the command above can be downloaded from the following links:

Inspect these files as we did above the synthetic example.


Visualizing HAllA results

hallagram is a tool included with HAllA for visualizing the three output files we looked at in text-form above. Run hallagram as follows (use hallagram -h for help with plot options):

$ cd synthetic_output
$ hallagram similarity_table.txt hypotheses_tree.txt associations.txt --outfile hallagram.png

Please open the file hallagram_strongest_7.png, which should look like this:

hallagram_strongest_7.png
  • How many features are involved in the largest association in the figure? How many pairwise associations does this cluster association represent? Does it appear as though the pairwise associations are reasonably homogeneous in terms of their strength?
  • Are there any pairs of clusters with a significant negative association?
  • Do you think that HAllA's approach improved statistical power in this scenario? How would power be different if all X and Y features were compared individually?

Let's try some of the other hallagram options using the pouchitis dataset (gut OTUs and host gene expression):

$ cd pouchitis_output
$ hallagram similarity_table.txt hypotheses_tree.txt associations.txt --outfile hallagram.pdf --outfile hallagram.png --similarity Spearman --axlabels "Macrobial OTUs" "Host transcipts" --strongest 50
  • --similarity option names the similarity methods has been used in this analysis in the legend.
  • --axlabels option add X-axis label and Y-axis label.
  • --strongest 50 option to used 50strongest associations order by similarity score(--order-by can be used to use pvalue or qvalue for order instead of similarity score.
hallagram.png
  • How would you interpret association number 32?
  • How would your answers regarding the previous hallagram change here?

HAllA extensions

Pairwise association testing

HAllA uses a hierarchical approach for blockwise association testing between clusters in different levels in hierarchies. Naive all-aginst-all (AllA), pairwise, association testing can be used by -a AllA. The default is -a HAllA. Let's try AllA approach on the synthetic data:

$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_alla -m spearman -a AllA

  • What has changed in the output files?

Similarity metrics

HAllA is extensible to similarity metrics. By default, HAllA uses normalized mutual information (NMI) and discretizes datasets to use NMI. We recommend using Spearman coefficient if all data are continuous, with a small number of samples, and looking for monotonic associations. Adjusted mutual information (AMI), maximum information coefficient (MIC), Pearson, distance correlation (dcor), and discretized maximum information coefficient (DMIC) are the other similarity metrics that currently are implemented in HAllA. A similarity metric is used to build hierarchical clusters and also to be used in permutation test in HAllA. -m $SIMILARTITY_METRIC is the option to be used with HAllA command line. Let's try dcor as similarity metric:

$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_dcor -m dcor

  • What has changed in the output files?

Multiple tests correction (FDR) methods

Benjamini–Hochberg(BH) as default, Benjamini–Yekutieli (BY), and Bonferroni are implemented as multiple test correction methods in HAllA. BH is used as the default in HAllA and can be changed by --fdr option. Let's try bonferroni:

$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_bonferroni -m spearman --fdr bonferroni

  • What has changed in the output files?
  • How the results change in AllA case with bonferroni?

Decomposition (representative) methods

Medoid of a cluster is used as a representative to participate in association testing. Decomposition methods such as principal component analysis (PCA), multiple correspondence analysis analysis (MCA), independent component analysis (ICA).PCA and ICA can be used only with continue data and when looking for monotonic relations. With PCA or ICA, the first component will be used as the representative of a cluster. MCA can be used with both categorical and continuous data. Let's try pca:

$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_pca -m spearman -d pca

Updated