# HAllA tutorial

HAllA (Hierarchical All-against-All association) is a tool to find multi-resolution associations in high-dimensional, heterogeneous datasets. For a pair of datasets containing measurements that describe the same set of samples, Hierarchical All-against-All Association (HAllA) testing proceeds by 1) discretizing features to a uniform representation, 2) hierarchically clustering each dataset separately to generate two data hierarchies, 3) coupling clusters of equivalent resolution between the two data hierarchies, and 4) iteratively testing coupled clusters of increasing resolution for statistically significant association.

Citation:

Gholamali Rahnavard, Eric A. Franzosa, Lauren J. McIver, Emma Schwager, Jason Lloyd-Price, George Weingart, Yo Sup Moon, Xochitl C. Morgan, Levi Waldron, Curtis Huttenhower, High-sensitivity pattern discovery in large multi'omic datasets. huttenhower.sph.harvard.edu/halla.

HAllA inputs. Data in scientific studies often come paired in the form of two high-dimensional datasets, where the dataset X (with p features/rows and n samples/columns) are assumed to be p predictor variables (or features) measured on n samples that give rise to d response variables contained in the dataset Y (with d features/rows and n samples/columns). Note that column i of X is sampled jointly with column i of Y, so that X and Y are aligned.

HAllA output. HAllA reports significant associations between clusters of related features. Each association is characterized by a cluster from the first dataset, a cluster from the second dataset, and measures of statistical significance and the effect size of the association between the clusters (by p-value, q-value, and similarity score.

## Installation

### Conda

You can install HAllA and other bioBakery tools automatically with Conda.

$conda install -c biobakery halla  This will also install all HAllA dependencies. ### pip install You can install HAllA automatically with pip. $ pip install halla


This will install the latest version of HAllA and all its dependencies.

### From Source

Alternatively, you can manually install HAllA from source and then manually install the dependencies.

$tar xzvf biobakery-halla-<versionid>.tar.gz$ cd biobakery-halla-<versionid>/


Step 2: Install HAllA:

$halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output  HAllA uses Spearman's rank correlation as the default similarity metric for continuous data, and if there is at least one categorical data it will uses Normalized Mutual Information (NMI) as similarity metric to compare features. If you would like to run with multiple cores, add the option --nproc. The --fdr option can be used to define the false discovery rate (FDR) procedure. "bh" refers to Benjamini-Hochberg FDR correction. The command above creates three primary output files: • associations.txt • similarity_table.txt • hypotheses_tree.txt Let's examine these files individually, starting with associations.txt: $ column -t -s $'\t' synthetic_output/associations.txt | less -S  This yields: association_rank cluster1 cluster1_similarity_score cluster2 cluster2_similarity_score pvalue qvalue 1 X12;X13 0.73301320528211289 Y12;Y13 0.74434573829531803 2.1564742078882269e-14 7.7633071483976174e-13 2 X14;X15 0.76941176470588246 Y14;Y15 0.65138055222088831 1.5622528336072545e-13 2.8120551004930578e-12 3 X9;X10;X11 0.54002400960384145 Y9;Y10;Y11 0.67788715486194473 7.8628592554387596e-12 9.4354311065265122e-11 4 X6;X7;X8 0.64297719087635052 Y6;Y7;Y8 0.5583193277310925 9.5425798025534227e-07 8.5883218222980804e-06 5 X0;X1;X2 0.63030012004801916 Y0;Y1;Y2 0.63087635054021607 1.7661674082593499e-06 1.2716405339467318e-05 6 X3;X4;X5 0.56230492196878745 Y3;Y4;Y5 0.68139255702280921 4.0472115008008864e-06 2.4283269004805316e-05 7 X15 1.0 Y6 1.0 0.00027968510022185388 0.0087101931211948768  This file reports associations between clusters, as described above. Now let's examine similarity_table.txt: $ column -t -s $'\t' synthetic_output/similarity_table.txt | less -S  This yields: # Y12 Y13 Y14 Y15 Y6 Y7 Y8 Y3 X12 0.851140456182 0.613829531813 0.0271308523409 -0.154477791116 0.00840336134454 -0.0623769507803 -0.226410564226 0.194525 X13 0.659927971188 0.840864345738 0.266458583433 0.0606482593037 -0.0209843937575 -0.0480672268908 -0.195774309724 0.033277 X0 -0.315342136855 -0.124321728691 -0.205378151261 -0.192412965186 -0.0327971188475 -0.253301320528 0.00523409363745 0.129219 X1 0.00792316926771 -0.0368307322929 -0.258967587035 -0.221032412965 -0.0811044417767 -0.0965666266507 -0.0492196878752 0.284417 X2 0.0295318127251 0.0791836734694 -0.0875390156062 -0.0306842737095 -0.0199279711885 -0.070156062425 0.0160864345738 0.110876 X3 -0.0569027611044 -0.187034813926 -0.0319327731092 -0.189339735894 -0.127779111645 -0.0739015606242 0.0584393757503 0.636590 X4 0.0462424969988 0.0257863145258 0.151692677071 0.0448019207683 0.0584393757503 -0.0641056422569 0.117599039616 0.477454 X5 0.116254501801 0.0631452581032 0.14612244898 0.154285714286 0.208547418968 0.121152460984 0.224297719088 0.309387 X9 0.121824729892 -0.0852340936375 0.0276110444178 -0.133733493397 -0.161296518607 -0.0427851140456 -0.0642016806723 0.311596 X10 0.0976230492197 -0.086962785114 0.0378871548619 -0.0967587034814 -0.0316446578631 -0.0776470588235 -0.00600240096038 -0.00792 X11 0.0653541416567 -0.00888355342137 0.0084993997599 -0.0721728691477 -0.0365426170468 0.0275150060024 0.0840816326531 0.096086 X14 0.0838895558223 0.361152460984 0.814549819928 0.610852340936 0.389771908764 0.143049219688 0.0766866746699 -0.33262 X15 0.0668907563025 0.258199279712 0.580984393758 0.825882352941 0.492533013205 0.316014405762 0.304009603842 -0.28720 X6 -0.00782713085234 0.0111884753902 0.295270108043 0.278175270108 0.658967587035 0.511356542617 0.291428571429 -0.38977 X7 -0.0905162064826 -0.0163745498199 0.0460504201681 0.148523409364 0.384585834334 0.629867947179 0.327442977191 -0.31303 X8 -0.138151260504 -0.0952220888355 0.0147418967587 0.205378151261 0.269339735894 0.547755102041 0.544489795918 -0.34386  This file contains pairwise similarity scores for all pairs of features from the first dataset and the second dataset. Option --write-hypothesis-tree can be used to write hypothesis tree with the halla command. The hypothesis tree will be in the hypotheses_tree.txt file: $ column -t -s $'\t' synthetic_output/hypotheses_tree.txt | less -S  This yields: Level Dataset 1 Dataset 2 0 X12;X13;X0;X1;X2;X3;X4;X5;X9;X10;X11;X14;X15;X6;X7;X8 Y12;Y13;Y14;Y15;Y6;Y7;Y8;Y3;Y4;Y5;Y9;Y10;Y11;Y0;Y1;Y2 1 X9;X10;X11 Y3;Y4;Y5 1 X9;X10;X11 Y12;Y13 1 X9;X10;X11 Y0;Y1;Y2 1 X9;X10;X11 Y9;Y10;Y11 1 X9;X10;X11 Y6;Y7;Y8 1 X9;X10;X11 Y14;Y15 1 X12;X13 Y3;Y4;Y5 1 X12;X13 Y12;Y13 1 X12;X13 Y0;Y1;Y2 1 X12;X13 Y9;Y10;Y11 1 X12;X13 Y6;Y7;Y8 1 X12;X13 Y14;Y15 1 X3;X4;X5 Y3;Y4;Y5 1 X3;X4;X5 Y12;Y13 1 X3;X4;X5 Y0;Y1;Y2 1 X3;X4;X5 Y9;Y10;Y11 1 X3;X4;X5 Y6;Y7;Y8 1 X3;X4;X5 Y14;Y15 1 X0;X1;X2 Y3;Y4;Y5 1 X0;X1;X2 Y12;Y13 1 X0;X1;X2 Y0;Y1;Y2 1 X0;X1;X2 Y9;Y10;Y11 1 X0;X1;X2 Y6;Y7;Y8 1 X0;X1;X2 Y14;Y15 1 X6;X7;X8 Y3;Y4;Y5 1 X6;X7;X8 Y12;Y13 1 X6;X7;X8 Y0;Y1;Y2 1 X6;X7;X8 Y9;Y10;Y11 1 X6;X7;X8 Y6;Y7;Y8 1 X6;X7;X8 Y14;Y15 1 X14;X15 Y3;Y4;Y5 1 X14;X15 Y12;Y13 1 X14;X15 Y0;Y1;Y2 1 X14;X15 Y9;Y10;Y11 1 X14;X15 Y6;Y7;Y8 1 X14;X15 Y14;Y15 2 X9 Y3 2 X9 Y5 and continues  This file contains a comprehensive report of all testing performed during the HAllA run (not limited to the significant associations reported in associations.txt. Level zero hold all the features in HAllA starts performing tests from level 1. • What is the pairwise similarity score between feature X9 and feature Y11 in association number 3? How does this score compare the similarity score given for association number 3? • What is the pairwise similarity score between X15 and Y6? How does this compare to the strength of association considered above? How is this difference reflected in the *p*-value and *q*-value of the test for these two features? ### Human gut microbiome versus host transcriptome in ulcerative colitis Here we will consider subsets of a published dataset (Morgan et al., Genome Biology 2015) that combined 1) 16S rRNA amplicon sequencing of the human gut microbiome (64 taxa) and 2) Affymetrix microarray screens of colonic RNA expression across 204 patients with ulcerative colitis (100 genes). We will refer to this as the "pouchitis dataset." The purpose of this study was to associate human genes and microbial taxa with the recurrence of inflammation following ileal resection surgery (a surgical procedure in ulcerative colitis that involves removing of the large intestine and rectum and attaching the lowest part of the small intestine to a hole made in the abdominal wall to allow waste to leave the body). Download the paired, subsampled OTU-gene datasets: Run HAllA on these datasets: $ halla -X otu_299.txt -Y gene_200.txt -o pouchitis_output -m spearman --header -q 0.05


Note the addition of the "-q" flag: this flag defines the target FDR, here 0.05, i.e. the expected fraction of false positive reports among returned significant associations. --header uses the header of the two datasets to find common columns (samples) and reorder them.

## Visualizing HAllA results

hallagram is a tool included with HAllA for visualizing the three output files we looked at in text-form above. Run hallagram as follows (use hallagram -h for help with plot options):

$cd synthetic_output$ hallagram similarity_table.txt hypotheses_tree.txt associations.txt --outfile hallagram.png


Please open the file hallagram_strongest_7.png, which should look like this:

• How many features are involved in the largest association in the figure? How many pairwise associations does this cluster association represent? Does it appear as though the pairwise associations are reasonably homogeneous in terms of their strength?
• Are there any pairs of clusters with a significant negative association?
• Do you think that HAllA's approach improved statistical power in this scenario? How would power be different if all X and Y features were compared individually?

Let's try some of the other hallagram options using the pouchitis dataset (gut OTUs and host gene expression):

$cd pouchitis_output$ hallagram similarity_table.txt hypotheses_tree.txt associations.txt --outfile hallagram.pdf --outfile hallagram.png --similarity Spearman --axlabels "Microbial OTUs" "Host transcripts" --strongest 50

• --similarity option names the similarity methods has been used in this analysis in the legend.
• --axlabels option add X-axis label and Y-axis label.
• --strongest 30 option to used 50strongest associations order by similarity score(--order-by can be used to use pvalue or qvalue for order instead of similarity score.

• How would you interpret association number 10?

## HAllA extensions

### Pairwise association testing

HAllA uses a hierarchical approach for blockwise association testing between clusters in different levels in hierarchies. Naive all-aginst-all (AllA), pairwise, association testing can be used by -a AllA. The default is -a HAllA. Let's try AllA approach on the synthetic data:

$halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_dcor -m dcor  • What has changed in the output files? ### Multiple tests correction (FDR) methods Benjamini–Hochberg(BH) as default, Benjamini–Yekutieli (BY), and Bonferroni are implemented as multiple test correction methods in HAllA. BH is used as the default in HAllA and can be changed by --fdr option. Let's try bonferroni: $ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_bonferroni -m spearman --fdr bonferroni


• What has changed in the output files?
• How the results change in AllA case with Bonferroni?

### Decomposition (representative) methods

Medoid of a cluster is used as a representative to participate in association testing. Decomposition methods such as principal component analysis (PCA), multiple correspondence analysis (MCA), independent component analysis (ICA).PCA and ICA can be used only with continuous data and when looking for monotonic relations. With PCA or ICA, the first component will be used as the representative of a cluster. MCA can be used for both categorical and continuous data. Let's try pca:

\$ halla -X X_16_100.txt -Y Y_16_100.txt --output synthetic_output_pca -m spearman -d pca


### Power evaluation for different data types and similarity metrics

Let's start with mixed data and try different similarity metrics: The following command generates a pair of datasets with mixed (categorical and continuous) data of 32 features and 100 sample, with uniform distribution an balanced clusters:

halladata -f 32 -n 100 -a mixed -d uniform -s balanced -o halla_data_f32_n100_mixed

First Try to run it by specifying the similarity metric:

halla -X halla_data_f32_n100_mixed/X_mixed_32_100.txt -Y halla_data_f32_n100_mixed/Y_mixed_32_100.txt -o halla_output_f32_n100_mixed -m spearman

What happens and why?

Let the tool decide the similarity metric:

halla -X halla_data_f32_n100_mixed/X_mixed_32_100.txt -Y halla_data_f32_n100_mixed/Y_mixed_32_100.txt -o halla_output_f32_n100_mixed

What is the discretizing method and what it does?

Now, let's generate synthetic continuous data with a linear relation between and with datasets and see how Spearman coefficient versus normalized mutual information report the significant association.

halladata -f 32 -n 50 -a line -d uniform -s balanced -o halla_data_f32_n50_line

Run HAllA with default similarity metric (will use spearman as all data are continuous):

halla -X halla_data_f32_n50_line/X_line_32_50.txt -Y halla_data_f32_n50_line/Y_line_32_50.txt -o halla_output_f32_n50_line_spearman

Run HAllA with NMI similarity metric (will discretize data):

halla -X halla_data_f32_n50_line/X_line_32_50.txt -Y halla_data_f32_n50_line/Y_line_32_50.txt -o halla_output_f32_n50_line_nmi -m nmi

Open the hallagram from the two last runs and see how they are different? What do you think causes this?

Updated