HAllA (Hierarchical All-against-All association) is a tool to find multi-resolution associations in high-dimensional, heterogeneous datasets. For a pair of datasets containing measurements that describe the same set of samples, Hierarchical All-against-All Association (HAllA) testing proceeds by 1) discretizing features to a uniform representation, 2) hierarchically clustering each dataset separately to generate two data hierarchies, 3) coupling clusters of equivalent resolution between the two data hierarchies, and 4) iteratively testing coupled clusters of increasing resolution for statistically significant association.
Gholamali Rahnavard, Eric A. Franzosa, Lauren J. McIver, Emma Schwager, Jason Lloyd-Price, George Weingart, Yo Sup Moon, Xochitl C. Morgan, Levi Waldron, Curtis Huttenhower, High-sensitivity pattern discovery in large multi'omic datasets. huttenhower.sph.harvard.edu/halla.
HAllA inputs. Data in scientific studies often come paired in the form of two high-dimensional datasets, where the dataset X (with p features/rows and n samples/columns) are assumed to be p predictor variables (or features) measured on n samples that give rise to d response variables contained in the dataset Y (with d features/rows and n samples/columns). Note that column i of X is sampled jointly with column i of Y, so that X and Y are aligned.
HAllA output. HAllA reports significant associations between clusters of related features. Each association is characterized by a cluster from the first dataset, a cluster from the second dataset, and measures of statistical significance and the effect size of the association between the clusters (by p-value, q-value, and similarity score.
- Please direct questions to the HAllA Google Group ( email@example.com ) and subscribe to receive HAllA news.
- For additional information, see the HAllA User Manual.
- How to run
- Visualizing HAllA results
- HAllA extensions
- Coming soon!
You can install HAllA and other bioBakery tools automatically with Conda.
$ conda install -c biobakery halla
This will also install all HAllA dependencies.
You can install HAllA automatically with pip.
$ pip install halla
This will install the latest version of HAllA and all its dependencies.
Alternatively, you can manually install HAllA from source and then manually install the dependencies.
Step 1: Download HAllA and unpack the software:
$ tar xzvf biobakery-halla-<versionid>.tar.gz $ cd biobakery-halla-<versionid>/
Step 2: Install HAllA:
$ python setup.py install
Add the --user option if you do not have root install permissions.
Step 3: Install the HAllA dependencies.
HAllA requires as input two tab-delimited text files representing two paired datasets describing the same set of samples. Download the set of two files to get started on the tutorial (click on the link then right-click on the "Save as..." option on the preview page to download the files), and in this tutorial, let's assume you save the file in Download directory (the default place usually).
These two files contain 16 normally-distributed features for 100 samples (all synthetic data). Cluster structure was spiked into each dataset, and some clusters were forced to be associated (for demonstration purposes).
Next, run HAllA on the two demo input files, placing the output files in your current working directory under synthetic_output:
$ cd ~/Downloads $ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output
HAllA uses Spearman's rank correlation as the default similarity metric for continuous data, and if there is at least one categorical data it will uses Normalized Mutual Information (NMI) as similarity metric to compare features. If you would like to run with multiple cores, add the option --nproc. The --fdr option can be used to define the false discovery rate (FDR) procedure. "bh" refers to Benjamini-Hochberg FDR correction.
The command above creates three primary output files:
Let's examine these files individually, starting with associations.txt:
$ column -t -s $'\t' synthetic_output/associations.txt | less -S
association_rank cluster1 cluster1_similarity_score cluster2 cluster2_similarity_score pvalue qvalue 1 X12;X13 0.73301320528211289 Y12;Y13 0.74434573829531803 2.1564742078882269e-14 7.7633071483976174e-13 2 X14;X15 0.76941176470588246 Y14;Y15 0.65138055222088831 1.5622528336072545e-13 2.8120551004930578e-12 3 X9;X10;X11 0.54002400960384145 Y9;Y10;Y11 0.67788715486194473 7.8628592554387596e-12 9.4354311065265122e-11 4 X6;X7;X8 0.64297719087635052 Y6;Y7;Y8 0.5583193277310925 9.5425798025534227e-07 8.5883218222980804e-06 5 X0;X1;X2 0.63030012004801916 Y0;Y1;Y2 0.63087635054021607 1.7661674082593499e-06 1.2716405339467318e-05 6 X3;X4;X5 0.56230492196878745 Y3;Y4;Y5 0.68139255702280921 4.0472115008008864e-06 2.4283269004805316e-05 7 X15 1.0 Y6 1.0 0.00027968510022185388 0.0087101931211948768
This file reports associations between clusters, as described above.
Now let's examine similarity_table.txt:
$ column -t -s $'\t' synthetic_output/similarity_table.txt | less -S
# Y12 Y13 Y14 Y15 Y6 Y7 Y8 Y3 X12 0.851140456182 0.613829531813 0.0271308523409 -0.154477791116 0.00840336134454 -0.0623769507803 -0.226410564226 0.194525 X13 0.659927971188 0.840864345738 0.266458583433 0.0606482593037 -0.0209843937575 -0.0480672268908 -0.195774309724 0.033277 X0 -0.315342136855 -0.124321728691 -0.205378151261 -0.192412965186 -0.0327971188475 -0.253301320528 0.00523409363745 0.129219 X1 0.00792316926771 -0.0368307322929 -0.258967587035 -0.221032412965 -0.0811044417767 -0.0965666266507 -0.0492196878752 0.284417 X2 0.0295318127251 0.0791836734694 -0.0875390156062 -0.0306842737095 -0.0199279711885 -0.070156062425 0.0160864345738 0.110876 X3 -0.0569027611044 -0.187034813926 -0.0319327731092 -0.189339735894 -0.127779111645 -0.0739015606242 0.0584393757503 0.636590 X4 0.0462424969988 0.0257863145258 0.151692677071 0.0448019207683 0.0584393757503 -0.0641056422569 0.117599039616 0.477454 X5 0.116254501801 0.0631452581032 0.14612244898 0.154285714286 0.208547418968 0.121152460984 0.224297719088 0.309387 X9 0.121824729892 -0.0852340936375 0.0276110444178 -0.133733493397 -0.161296518607 -0.0427851140456 -0.0642016806723 0.311596 X10 0.0976230492197 -0.086962785114 0.0378871548619 -0.0967587034814 -0.0316446578631 -0.0776470588235 -0.00600240096038 -0.00792 X11 0.0653541416567 -0.00888355342137 0.0084993997599 -0.0721728691477 -0.0365426170468 0.0275150060024 0.0840816326531 0.096086 X14 0.0838895558223 0.361152460984 0.814549819928 0.610852340936 0.389771908764 0.143049219688 0.0766866746699 -0.33262 X15 0.0668907563025 0.258199279712 0.580984393758 0.825882352941 0.492533013205 0.316014405762 0.304009603842 -0.28720 X6 -0.00782713085234 0.0111884753902 0.295270108043 0.278175270108 0.658967587035 0.511356542617 0.291428571429 -0.38977 X7 -0.0905162064826 -0.0163745498199 0.0460504201681 0.148523409364 0.384585834334 0.629867947179 0.327442977191 -0.31303 X8 -0.138151260504 -0.0952220888355 0.0147418967587 0.205378151261 0.269339735894 0.547755102041 0.544489795918 -0.34386
This file contains pairwise similarity scores for all pairs of features from the first dataset and the second dataset.
Option --write-hypothesis-tree can be used to write hypothesis tree with the halla command. The hypothesis tree will be in the hypotheses_tree.txt file:
$ column -t -s $'\t' synthetic_output/hypotheses_tree.txt | less -S
Level Dataset 1 Dataset 2 0 X12;X13;X0;X1;X2;X3;X4;X5;X9;X10;X11;X14;X15;X6;X7;X8 Y12;Y13;Y14;Y15;Y6;Y7;Y8;Y3;Y4;Y5;Y9;Y10;Y11;Y0;Y1;Y2 1 X9;X10;X11 Y3;Y4;Y5 1 X9;X10;X11 Y12;Y13 1 X9;X10;X11 Y0;Y1;Y2 1 X9;X10;X11 Y9;Y10;Y11 1 X9;X10;X11 Y6;Y7;Y8 1 X9;X10;X11 Y14;Y15 1 X12;X13 Y3;Y4;Y5 1 X12;X13 Y12;Y13 1 X12;X13 Y0;Y1;Y2 1 X12;X13 Y9;Y10;Y11 1 X12;X13 Y6;Y7;Y8 1 X12;X13 Y14;Y15 1 X3;X4;X5 Y3;Y4;Y5 1 X3;X4;X5 Y12;Y13 1 X3;X4;X5 Y0;Y1;Y2 1 X3;X4;X5 Y9;Y10;Y11 1 X3;X4;X5 Y6;Y7;Y8 1 X3;X4;X5 Y14;Y15 1 X0;X1;X2 Y3;Y4;Y5 1 X0;X1;X2 Y12;Y13 1 X0;X1;X2 Y0;Y1;Y2 1 X0;X1;X2 Y9;Y10;Y11 1 X0;X1;X2 Y6;Y7;Y8 1 X0;X1;X2 Y14;Y15 1 X6;X7;X8 Y3;Y4;Y5 1 X6;X7;X8 Y12;Y13 1 X6;X7;X8 Y0;Y1;Y2 1 X6;X7;X8 Y9;Y10;Y11 1 X6;X7;X8 Y6;Y7;Y8 1 X6;X7;X8 Y14;Y15 1 X14;X15 Y3;Y4;Y5 1 X14;X15 Y12;Y13 1 X14;X15 Y0;Y1;Y2 1 X14;X15 Y9;Y10;Y11 1 X14;X15 Y6;Y7;Y8 1 X14;X15 Y14;Y15 2 X9 Y3 2 X9 Y5 and continues
This file contains a comprehensive report of all testing performed during the HAllA run (not limited to the significant associations reported in associations.txt. Level zero hold all the features in HAllA starts performing tests from level 1.
- What is the pairwise similarity score between feature X9 and feature Y11 in association number 3? How does this score compare the similarity score given for association number 3?
- What is the pairwise similarity score between X15 and Y6? How does this compare to the strength of association considered above? How is this difference reflected in the *p*-value and *q*-value of the test for these two features?
Here we will consider subsets of a published dataset (Morgan et al., Genome Biology 2015) that combined 1) 16S rRNA amplicon sequencing of the human gut microbiome (64 taxa) and 2) Affymetrix microarray screens of colonic RNA expression across 204 patients with ulcerative colitis (100 genes). We will refer to this as the "pouchitis dataset." The purpose of this study was to associate human genes and microbial taxa with the recurrence of inflammation following ileal resection surgery (a surgical procedure in ulcerative colitis that involves removing of the large intestine and rectum and attaching the lowest part of the small intestine to a hole made in the abdominal wall to allow waste to leave the body).
Download the paired, subsampled OTU-gene datasets:
Run HAllA on these datasets:
$ halla -X otu_299.txt -Y gene_200.txt -o pouchitis_output -m spearman --header -q 0.05
Note the addition of the "-q" flag: this flag defines the target FDR, here 0.05, i.e. the expected fraction of false positive reports among returned significant associations. --header uses the header of the two datasets to find common columns (samples) and reorder them.
hallagram is a tool included with HAllA for visualizing the three output files we looked at in text-form above. Run hallagram as follows (use hallagram -h for help with plot options):
$ cd synthetic_output $ hallagram similarity_table.txt hypotheses_tree.txt associations.txt --outfile hallagram.png
Please open the file hallagram_strongest_7.png, which should look like this:
- How many features are involved in the largest association in the figure? How many pairwise associations does this cluster association represent? Does it appear as though the pairwise associations are reasonably homogeneous in terms of their strength?
- Are there any pairs of clusters with a significant negative association?
- Do you think that HAllA's approach improved statistical power in this scenario? How would power be different if all X and Y features were compared individually?
Let's try some of the other hallagram options using the pouchitis dataset (gut OTUs and host gene expression):
$ cd pouchitis_output $ hallagram similarity_table.txt hypotheses_tree.txt associations.txt --outfile hallagram.pdf --outfile hallagram.png --similarity Spearman --axlabels "Microbial OTUs" "Host transcripts" --strongest 50
- --similarity option names the similarity methods has been used in this analysis in the legend.
- --axlabels option add X-axis label and Y-axis label.
- --strongest 30 option to used 50strongest associations order by similarity score(--order-by can be used to use pvalue or qvalue for order instead of similarity score.
- How would you interpret association number 10?
- How would your answers regarding the previous hallagram change here?
HAllA uses a hierarchical approach for blockwise association testing between clusters in different levels in hierarchies. Naive all-aginst-all (AllA), pairwise, association testing can be used by -a AllA. The default is -a HAllA. Let's try AllA approach on the synthetic data:
$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_alla -m spearman -a AllA
- What has changed in the output files?
HAllA is extensible to similarity metrics. By default, HAllA uses normalized mutual information (NMI) and discretizes datasets to use NMI. We recommend using Spearman coefficient if all data are continuous, with a small number of samples, and looking for monotonic associations. Adjusted mutual information (AMI), maximum information coefficient (MIC), Pearson, distance correlation (dcor), and discretized maximum information coefficient (DMIC) are the other similarity metrics that currently are implemented in HAllA. A similarity metric is used to build hierarchical clusters and also to be used in permutation test in HAllA. -m $SIMILARTITY_METRIC is the option to be used with HAllA command line. Let's try dcor as similarity metric:
$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_dcor -m dcor
- What has changed in the output files?
Benjamini–Hochberg(BH) as default, Benjamini–Yekutieli (BY), and Bonferroni are implemented as multiple test correction methods in HAllA. BH is used as the default in HAllA and can be changed by --fdr option. Let's try bonferroni:
$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_bonferroni -m spearman --fdr bonferroni
- What has changed in the output files?
- How the results change in AllA case with Bonferroni?
Medoid of a cluster is used as a representative to participate in association testing. Decomposition methods such as principal component analysis (PCA), multiple correspondence analysis (MCA), independent component analysis (ICA).PCA and ICA can be used only with continuous data and when looking for monotonic relations. With PCA or ICA, the first component will be used as the representative of a cluster. MCA can be used for both categorical and continuous data. Let's try pca:
$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_pca -m spearman -d pca
Let's start with mixed data and try different similarity metrics: The following command generates a pair of datasets with mixed (categorical and continuous) data of 32 features and 100 sample, with uniform distribution an balanced clusters:
halladata -f 32 -n 100 -a mixed -d uniform -s balanced -o halla_data_f32_n100_mixed
First Try to run it by specifying the similarity metric:
halla -X halla_data_f32_n100_mixed/X_mixed_32_100.txt -Y halla_data_f32_n100_mixed/Y_mixed_32_100.txt -o halla_output_f32_n100_mixed -m spearman
What happens and why?
Let the tool decide the similarity metric:
halla -X halla_data_f32_n100_mixed/X_mixed_32_100.txt -Y halla_data_f32_n100_mixed/Y_mixed_32_100.txt -o halla_output_f32_n100_mixed
What is the discretizing method and what it does?
Now, let's generate synthetic continuous data with a linear relation between and with datasets and see how Spearman coefficient versus normalized mutual information report the significant association.
halladata -f 32 -n 50 -a line -d uniform -s balanced -o halla_data_f32_n50_line
Run HAllA with default similarity metric (will use spearman as all data are continuous):
halla -X halla_data_f32_n50_line/X_line_32_50.txt -Y halla_data_f32_n50_line/Y_line_32_50.txt -o halla_output_f32_n50_line_spearman
Run HAllA with NMI similarity metric (will discretize data):
halla -X halla_data_f32_n50_line/X_line_32_50.txt -Y halla_data_f32_n50_line/Y_line_32_50.txt -o halla_output_f32_n50_line_nmi -m nmi
Open the hallagram from the two last runs and see how they are different? What do you think causes this?