# CCREPE Tutorial

The CCREPE (Compositionality Corrected by REnormalizaion and PErmutation) package is designed to assess the significance of general similarity measures in compositional datasets.

CCREPE can be downloaded from this link. CCREPE is also available as a bitbucket repository. For additional information, please refer to the CCREPE paper (in progress) and the documentation.

If you use this package, please cite as below:
Emma Schwager and Colleagues. Detecting statistically significant associations between sparse and high dimensional compositional data. In Progress

## 1. Installation

CCREPE can be installed using either of the following two options.

### 1.1 Pre-requisites

• R needs to be installed on your computer.

For Linux users, you can just use apt-get install r-base for installing R. For users with different platforms, please refer to the website above to look up installation instructions.

• R package infotheo needs to be installed in R.

Once R is installed, run the following command in R to install the package.

> install.packages('infotheo')


### 1.2 Installing CCREPE

• Download: You may download the ccrepe-packagefrom the list.

• For Linux users, you can just run the following command to install the ccrepe download to R

$R CMD INSTALL --build <insert-download-name.tar.gz> • For users with other platforms, please refer to R documentation to see how to source external packages. OR • Clone the repository: You may clone the repository by running the following command from a Terminal.$ hg clone http://bitbucket.org/biobakery/ccrepe ccrepe

Once the package has been downloaded and incorporated in R, you may run the following command to import the ccrepe-package.

> library(ccrepe)


## 2. Running CCREPE

Once the ccrepe-package is installed, you may now proceed with use it. Please ensure that R is installed on your computer. For instructions on installing R please refer to their website.

The package contains two packages (i) ccrepe and (ii) nc-score. For instructions on each, please see below.

### 2.2 ccrepe function

ccrepe calculates compositionality-corrected p-values and q-values for compositional data using an arbitrary distance metric. For details about the input argument, please refer to the detailed documentation.

For the purpose of this tutorial we will run ccrepe on two datasets with a nc.score as the similarity score.

• Open R
• Run the following command to import the library ccrepe
> library(ccrepe)

• The input datasets are shown below for your reference:

text.input

           Feature 1  Feature 2  Feature 3   Feature 4
Sample 1  0.09913084 0.12746072 0.53385029 0.239558154
Sample 2  0.39666736 0.19993817 0.02417398 0.379220490
Sample 3  0.24119443 0.08419378 0.32709373 0.347518058
Sample 4  0.39670572 0.20889021 0.17157276 0.222831316
Sample 5  0.46209528 0.22016053 0.30927015 0.008474046
Sample 6  0.25553284 0.14904298 0.56854622 0.026877963
Sample 7  0.47681832 0.20330031 0.04027400 0.279607363
Sample 8  0.16694612 0.17131849 0.42224798 0.239487416
Sample 9  0.48773148 0.37592572 0.12448270 0.011860096
Sample 10 0.51668975 0.28593023 0.12065695 0.076723068


text.input.2

         Feature 1      Feature 2       Feature 3       Feature 4       Feature 5       Feature 6       Feature 7
Sample 1 0.458561155        0.008092532     0.07722429      0.061862506     0.141716599     0.160429523     0.092113392
Sample 2 0.115176017        0.215269857     0.33960857      0.127598647     0.111312569     0.006027953     0.085006387
Sample 3 0.549371433        0.019962964     0.01227265      0.051829919     0.074611054     0.048762656     0.243189326
Sample 4 0.284740019        0.190046266     0.02880524      0.142821805     0.028813184     0.272138724     0.052634764
Sample 11 0.005447614       0.080074742     0.01086816      0.009454749     0.002404633     0.883554158     0.008195943
Sample 5 0.576470738        0.042814009     0.04274546      0.067392553     0.029867829     0.209886768     0.030822642
Sample 6 0.4530424  0.044092102     0.04207554      0.347114356     0.031553487     0.034537133     0.04758498
Sample 7 -0.088121495       0.114319848     0.38703157      0.107000574     0.24974684      0.204100466     0.025922193
Sample 8 0.146175965        0.517055805     0.13548013      0.119349245     0.020930469     0.030382319     0.030626065
Sample 12 0.004911492       0.414258262     0.07665803      0.008781068     0.026323325     0.396293546     0.072774276
Sample 9 0.220817751        0.054693589     0.11161043      0.229245931     0.153135574     0.108948339     0.121548389
Sample 10 0.179461466       0.148850896     0.07424187      0.28602251      0.048613054     0.091058307     0.171751894
Sample 13 0.24914881        0.171957509     0.13331199      0.043893814     0.027837292     0.243969848     0.129880733
Sample 14 0.619468187       0.175914305     0.01021288      0.050524383     0.018911969     0.109652865     0.015315407
Sample 15 0.024403392       0.118639502     0.1164575       0.196565283     0.299012684     0.02810215      0.216819491

• Run the following command to run CCREPE on two datasets test.input and test.input.2, with the NC-score as the similarity scoring method (provide by the sim.score argument).
> out <- ccrepe(x = test.input, y = test.input.2, sim.score = nc.score, iterations = 20, min.subj = 10)

• The out variable will contain the following
• p.values
• z.stat
• q.values
• sim.score

The output is shown below for your reference:

$p.values Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 Feature 7 Feature 1 0.5263075 0.8077156 0.8100249 0.1969555 0.6349808 0.2337343 0.3693508 Feature 2 0.8735528 0.9570482 0.9706203 0.4088109 0.7971789 0.6908775 0.6616025 Feature 3 0.1999377 0.4515583 0.3964722 0.4689658 0.2959280 0.5885919 0.5062330 Feature 4 0.6267964 0.5877425 0.4874545 0.1158420 0.1532040 0.8464301 0.8808116$z.stat
Feature 1   Feature 2   Feature 3  Feature 4  Feature 5  Feature 6
Feature 1 -0.6336527 -0.24337417 -0.24039385  1.2902741  0.4747281  1.1907944
Feature 2 -0.1591473 -0.05385807 -0.03683033  0.8259880  0.2569998  0.3976645
Feature 3  1.2817292 -0.75281958 -0.84793845 -0.7241628 -1.0452055 -0.5408777
Feature 4 -0.4862408  0.54211040  0.69436309 -1.5724682  1.4283052 -0.1936754
Feature 7
Feature 1  0.8976900
Feature 2  0.4377017
Feature 3 -0.6647146
Feature 4  0.1499404

$q.values Feature 1 Feature 2 Feature 3 Feature 4 Feature 5 Feature 6 Feature 7 Feature 1 4.115114 4.018890 3.855147 7.186497 3.861521 5.117088 5.775791 Feature 2 3.824895 3.880078 3.794563 4.972220 4.155343 3.781303 3.811658 Feature 3 5.471483 4.942928 5.424918 4.666796 5.398899 4.026842 4.262629 Feature 4 4.035971 4.289100 4.446551 12.680502 8.385145 3.860559 3.708345$sim.score
Feature 1  Feature 2  Feature 3  Feature 4  Feature 5  Feature 6
Feature 1 -0.28571429  0.1038961  0.1515152  0.1601732  0.1515152  0.3809524
Feature 2 -0.31168831  0.3593074 -0.0952381  0.3593074 -0.0952381  0.3809524
Feature 3  0.58874459 -0.5887446 -0.5887446 -0.3333333 -0.5887446 -0.3116883
Feature 4 -0.07792208 -0.0952381  0.1515152 -0.5324675  0.3809524 -0.0952381
Feature 7
Feature 1  0.20779221
Feature 2  0.16017316
Feature 3 -0.07792208
Feature 4  0.16017316


For more examples, please refer to the documentation

### 2.3 nc.score function

nc.score provides a novel similarity measure (the N-dimensional checkerboard score: NC-score), particularly appropriate to compositions dervied from microbial community sequencing data. For details about the input argument, please refer to the detailed documentation.

For the purpose of this tutorial we will run nc.score on two datasets.

• Open R
• Run the following command to import the library ccrepe
> library(ccrepe)

• For your reference the input datasets are below:

test.input

            Feature 1  Feature 2  Feature 3  Feature 4
Sample 1   0.53098625 0.24945178 0.16516569 0.05439628
Sample 2   0.11334774 0.32356694 0.38591054 0.17717477
Sample 3   0.22339983 0.12784189 0.24400540 0.40475287
Sample 4  -0.56292940 0.72177457 0.45731308 0.38384175
Sample 5   0.06740686 0.01687197 0.79829941 0.11742176
Sample 6  -0.39967644 0.11066224 0.38134556 0.90766864
Sample 7   0.52663095 0.29204997 0.03995832 0.14136075
Sample 8   0.63055974 0.31210092 0.03521166 0.02212769
Sample 9   0.08308327 0.05428329 0.84239772 0.02023572
Sample 10  0.55629625 0.30391172 0.11810698 0.02168505


test.input.2

          Feature 1  Feature 2   Feature 3  Feature 4
Sample 1  0.4856505 0.25517410 0.004001302 0.25517410
Sample 2  0.4009346 0.21260883 0.173847734 0.21260883
Sample 3  0.2622234 0.19282519 0.352126182 0.19282519
Sample 4  0.2156559 0.12046793 0.543408230 0.12046793
Sample 5  0.2821932 0.07223021 0.573346348 0.07223021
Sample 6  0.4216509 0.24159265 0.095163833 0.24159265
Sample 7  0.3919592 0.21705939 0.173921984 0.21705939
Sample 8  0.5283681 0.22512167 0.021388590 0.22512167
Sample 9  0.5373268 0.17106386 0.120545436 0.17106386
Sample 10 0.2697604 0.28053863 0.169162368 0.28053863

• Run the following command to calculate the NC-score for the two datasets test.input and test.input.2, with the NC-score as the similarity scoring method (provide by the sim.score argument).
> out2 <- nc.score(x = test.input, y = test.input.2)

• The out2 variable will contain the NC-scores for the two datasets.

The output is shown below for your reference:

           Feature 1   Feature 2  Feature 3   Feature 4
Feature 1         NA  0.38095238 -0.7489177 -0.58874459
Feature 2  0.3809524          NA -0.3809524 -0.07792208
Feature 3 -0.7489177 -0.38095238         NA  0.38095238
Feature 4 -0.5887446 -0.07792208  0.3809524          NA


For more examples, please refer to the documentation