Improvements to the genome-wide chromatin conformation analysis techniques enabled scientists explore the chromatin structure at an unprecedent scale. labs from different backgrounds are starting to adopt the high-resolution Hi-C protocols in their studies, which led to an increasing amount of available Hi-C data.
With more data available, scientists would like perform comparative analysis between the available Hi-C samples, i.e: finding differential chromatin interactions (DCI).
Previously developed differential chromatin interaction detection techniques were designed with a low-resolution Hi-C in mind (40kb or more). Thus, they based their models on the assumption of the independence between the different interacting pairs (Shown in Fig.1a), which is logical as the low-resolution Hi-C maps capture interactions between relatively distant chromatin fibers.
However, when dealing with high-resolution Hi-C (less than 10kb or so) the independence assumption between the adjacent “sequares” in the Hi-C matrix does not hold, due to the polymer nature of the chromatin that imply the dependence between adjacent chromatin fibers (Fig.1b).
To overcome this issue, we developed a new differential chromatin interaction detection technique that takes into consideration the dependency between neighboring loci. As depicted in this Fig.1b, when there is a change in the interaction frequency between the bins \((i,j)\) we expect this movement to be reflected on the interaction frequencies in the bins around it.
FIND
This part maybe boring, but I put it as summary for people who want to try the method but didn't go through the paper yet.
FIND
is designed for problems in which we have two biological conditions \(c\in \{1,2\}\), each with a certain number of Hi-C replicates \(\{n_c\}\).
FIND
is based on the simple idea that the establishment of a chromatin loop between two loci \(i\) and \(j\), a “mountain shape” structure should be formed (due to the spatial dependency) in the 3D space defined by the 2D coordinates of the loci and their interaction frequency. If there is a change in the interaction frequency, the height of the tip of the mountain (shown in Fig.2 as \(\mu_1\)) will change.
This change, will also influence the height of the points located in the neighborhood of \(\mu_1\) to change also. If this change is significance, the density of the points around \(\mu_1\) should change between the two biological conditions.
To estimate this change, we first define a window of size \(W\) around our point of interest \((i,j)\). For example if the window is of size 3, it will include all the \(W^2\) points located in the coordinates defined by the Cartesian product \(\{i-1,i,i+1\} \times \{j-1,j,j+1\}\). In this annimation generated using conv_arithmetic we show how FIND scans through the Hi-C matrix using a window of size 3.
For the pairwise interactions located at the borders of the matrix, only the neighbours located in the matrix. So for the interaction (1,1) only 4 neighbours will be considered.
Then, we compare the content of this window between the two conditions.
To do this comparison, we fist rank the bins in each window and in each replicate according to their 3D distance to the pairwise interaction \((i,j,\mu_1)\), such as \(\mu_1\) is mean interaction frequency of \((i,j)\) in the first condition.
Let \(x_{n,k}^{(c )}\) be the distance from the \(k^{th}\) nearest neighbor to \(\mu_1\) in the \(n^{th}\) replicate of condition \(c\). The probability to observe the k-nearest neighbour at a distance \(x_{n,k}^{(c )}\) is euqal to (Burguet et al. 2009, 2011; Burguet and Andrey 2014): \[f(x_{n,k}^{(c )})= \frac{(4\lambda^{(c )}\pi)^k}{3^{k-1}(k-1)!} (x_{n,k}^{(c )})^{k-1} exp(\lambda^{(c )}\frac{4\pi}{3}x_{n,k}^{(c )})\]
The equation, looks a bit confusing, but if you read it carefully, you can notice that the only unkown is \(\lambda^{(c )}\), which represents the estimated density of the \(k^{th}\) nearest neighbors around \(\mu_1\) in the replicates of the condition \(c\). To estimate \(\lambda^{(c )}\), we just need to calculate the maximum likelihood given by this equation: \[\lambda^{( c )}(\mu_1)=\frac{n_ck-1}{\frac{4\pi}{3}\sum_{n=1}^{n_c}(x_{n,k}^{(c )})^3}\]
Using this formula, we can then estimate the density of the k-nearest neighbours (KNN) in the 1st condition around \(\mu_1\), \(\lambda^{(1)}(\mu_1)\), and the density of the KNN of the 2nd condition around \(\mu_1\), \(\lambda^{(2)}(\mu_1)\). The ratio between these two densities \(R_k(\mu_1)\) will tell us, if the KNNs change their distribution around \(\mu_1\) or not. It can be shown that this ratio follows a Fisher distribution with \(2n_1k\) and \(2n_2k\) degree of freedome: \[R_k(\mu_1) \sim FS(2n_1k,2n_2k)\]
Thus, we can calculate for each KKN a p-value that indicate it shows a differential distribution around \(\mu_1\) or not. To avoid problems related to the variability between replicates, we also use \(\mu_2\) (the mean frequency of \((i,j)\) in the second condition), and do the same calculations. Hence, at the end we get \(2W^2\) p-values (\(W^2\) p-values using \(\mu_1\) as a reference and \(W^2\) p-values using \(\mu_2\) as reference). We need then to combine these p-value to calculate a final combined p-value.
Different p-value combination methods exist; however, each p-value combination method has a different hypothesis setting. Some are very stringent and require all the p-values to be significant, others are very loose and require one or more p-values to be significant in order for the combined p-value to be significant. For a short review, you can check (Lun-Ching Chang et al, 2013, BMC Bioinformatics). In our case, we use the r-th ordered p -value (rOP) which requires that at least \(r\) out the \(2W^2\) p-values to be significant in order for the combined p-value to be significant. This statistic will give us more flexibility as the user can select the level of stringency he wants to apply.
to accelerate data processing and use as less memory as possible, FIND
requires the Matrix
package. For visualization, in this vignette we prefer to use the rasterVis
package. The mvtnorm
is needed only in the simulation. We also need the gridExtra
to arrange the plots.
require(FIND)
require(Matrix)
require(mvtnorm)
library(raster)
require(rasterVis)
require(gridExtra)
require(HiTC)
require(edgeR)
require(ggsci)
To demonstrate the functionalities of FIND
, we use this simulation example in
which we load a part of the K562 matrix generated by Rao et al, 2014 and use it
as a base to generate two Hi-C matrices.
In this example, we will generate 2 replicates in each condition. The simulated pairwise interactions will have a 0.1% chance to be diffrential, with 40% of them will be showing a intrease of 5 fold-change
in the interaction frequency in the second condition. We remove the noisy interactions that have a dispersion larger than 0.1. This can be done using the generateSimulation
function
data(K562_matrix_small)
simRes <- generateSimulation(K562_matrix_small[1:200, 1:200], pDiff = 0.01, pUp = 0.6,
foldDiff = 3, extreme.dispersion = 0.1, nrep1 = 2, nrep2 = 2)
## Preparing dataset.
## Sampling.
## Calculating differential expression.
## Simulating data.
The generateSimulation
function returns a list that contains: the simulated Hi-C samples and a binarry matrix that contains the position of the DCIs.
simRes
## SimulatedHiC object
## Condition 1 with 2 replicates
## Condition 2 with 2 replicates
## Simulated matrix size (200,200)
For example, we can use the code bellow to display the simulated data
p1 <- levelplot(as.matrix(simRes@Hic1[[1]]), par.settings = YlOrRdTheme, main = "Cond1 Rep1")
p2 <- levelplot(as.matrix(simRes@Hic1[[2]]), par.settings = YlOrRdTheme, main = "Cond1 Rep2")
p3 <- levelplot(as.matrix(simRes@Hic2[[1]]), par.settings = YlOrRdTheme, main = "Cond2 Rep1")
p4 <- levelplot(as.matrix(simRes@Hic2[[2]]), par.settings = YlOrRdTheme, main = "Cond2 Rep2")
p5 <- levelplot(as.matrix(simRes@trueStates), par.settings = YlOrRdTheme, main = "DCIs position")
# Here we use ncol=1 due to the limite width of page, if you have a large
# screen you can put 2 plots in a column
grid.arrange(p1, p2, p3, p4, p5, ncol = 2)
Instead you can just display the log2 fold-change between the samples
## Plot the log2 Fold-change between the two conditions
m1 <- simRes@Hic1[[1]] + simRes@Hic1[[2]]
m2 <- simRes@Hic2[[1]] + simRes@Hic2[[2]]
levelplot(as.matrix(log2((m1 + 0.01)/(m2 + 0.01))), par.settings = BuRdTheme, main = "log2(Cond1/cond2)")
In many cases you want to try some newly published Hi-C normalization methods or you developped your own Hi-C processing methods, then you'll probably load it as an R matrix. In this case you can directly call the functin getDiff
.
In this example, we will continue with the simulated Hi-C data. Here, we pass two lists of Sparce-matrices and we specifiy that we will be using a window of 3 and we want to calculate the p-value on the
DCis <- getDCIs_fromSim(x = simRes, windowSize = 3, alpha = 0.7, method = "hardCutoff",
qvalue = 0.01, isrOP_qval = FALSE)
## Warning in getDCIs_fromMat(x@Hic1, x@Hic2, windowSize, method, alpha, qvalue, :
## Quantile regression can be very slow using hard clusting instead
equivantly, if we have two list of matrices we could have loaded them using the getDCIs_fromMat
DCis <- getDCIs_fromMat(Hic1 = simRes@Hic1, Hic2 = simRes@Hic2, windowSize = 3, alpha = 0.7,
method = "hardCutof", qvalue = 0.01, isrOP_qval = FALSE)
We can plot the q-values
p1 <- levelplot(as.matrix(-log10(DCis@qvals)), par.settings = YlOrRdTheme, main = "-log10(qvalue) after cutoff")
p2 <- levelplot(as.matrix(DCis@DCIPos), par.settings = YlOrRdTheme, main = "Predicted DCIs")
p3 = levelplot(as.matrix(log2((m1 + 0.01)/(m2 + 0.01))), par.settings = BuRdTheme,
main = "log2(Cond1/cond2)")
grid.arrange(p3, p1, p2, ncol = 2)
After p-value correction diagonal-wise, we select only the regions that cluster together, to avoid any noisy regions. Probably it is not a very statistical way that doing this, as we are not really sure how it controls for false-positives, but it tends to select regions that show a big structural change. Further down-stream analysis should be done.
We can see that it is a hard to select a cutoff, thus, it is better to use a kind of a semi-automatic way to select the q-value. One solution is to use quantile regression to select the best cutoff given the distance.
Bellow we show the distribution of the q-values given the different distances (THIS IS DEPRECATED AS IT IS VERY SLOW APROACH)
smry <- summary(DCis@qvals)
smry$x <- -log10(smry$x)
smry$dist <- smry$j - smry$i
## remove very low qvals as they are the majority smry <- subset(smry , x>
## -log10(pbeta(0.1,10,18-10+1)) )
o <- order(smry$dist)
smry <- smry[o, ]
X <- model.matrix(smry$x ~ splines::bs(smry$dist, df = 9))
smoothScatter(smry$dist, smry$x, xlab = "Distance", ylab = "-log10(qvalue)")
quants = seq(0.9, 0.99, 0.01)
cols = (ggsci::pal_d3())(length(quants))
for (j in 1:length(quants)) {
thr <- quantreg::rq(x ~ splines::bs(dist, df = 9), data = smry, tau = quants[j])
thr.fit <- X %*% thr$coef
smry$threshold <- thr.fit
lines(smry$dist, smry$threshold, col = cols[j], lwd = 2)
}
legend(max(smry$dist) - 20, max(smry$x) - 20, legend = quants, col = cols, lty = 1,
cex = 0.5)
As in this simulated examples, the DCIs were generated independently from the interaction span, we notice that the different quantiles show a kind off uniform distribution, indicating the independence.
If the quantile regression is selected (method = "quantreg"
), the 99th percentile is selected by defaults.
We can select a lower cutoff using the quantile
parameter.
DCis <- getDCIs_fromSim(x = simRes, windowSize = 3, alpha = 0.7, method = "quantreg",
quantile = 0.99)
## Warning in getDCIs_fromMat(x@Hic1, x@Hic2, windowSize, method, alpha, qvalue, :
## Quantile regression can be very slow using hard clusting instead
We can compare our results to true location of DCIs.
compareWithTrueDCI(DCis@DCIPos, simRes@trueStates)
DCis_par = getDCIs_parallel(Hic1 = simRes@Hic1,
Hic2 = simRes@Hic2,
windowSize = 3,
alpha = 0.7,
chunk_size=100,
method = "quantreg",
quantile = 0.99,
qvalue = 1e-6,
isrOP_qval = F,
nbProcessors = 4,
verbose = TRUE)
Here we want to make sure that the qvalues calculated in the paraallel version are similar to the ones obtaines in the sequential version.
smry_par = summary(DCis_par@qvals)
smry_seq = summary(DCis@qvals)
colnames(smry_par)[3] = "qval_par"
colnames(smry_seq)[3] = "qval_seq"
mrg = merge(smry_par, smry_seq, all=TRUE)
# a few sites will show some difference, generally very slight at the decimal positions.
nrow(subset(mrg, qval_par != qval_seq))
FIND
also provides a function to perform genomewide diffrential chromatin
interaction detection. The internals of the function depend on the HiTC
package.
For the moment we support matrix files that are in the same format as the one
generated by HiC-Pro
.
Two types of files are needed:
151 151 68.372895
151 152 51.179648
151 153 37.575510
151 154 17.432925
151 155 8.201022
151 156 18.941581
chr1 0 20000 1
chr1 20000 40000 2
chr1 40000 60000 3
chr1 60000 80000 4
chr1 80000 100000 5
chr1 100000 120000 6
Suppose we have a folder with 4 Hi-C matrices, we can first load them using
the loadHicExperiment
function:
matrices_lst <- as.list(list.files("HiC_results/",pattern = "*iced.matrix"))
binIndices = "MM9_5000_abs.bed"
gw_interactions <- loadHicExperiment(matrices_lst,binIndices = binIndices,
groups =c(1,1,0,0),type = "HicPro",
genome = "mm9")
Next, we can run the differential analysis. Chromosomes are processed sequenciantially,
but the interaction matrix of each chromosome is processed sepraratly.
This choise is to euilibrate the tradeoff between speed and memory,
we can parallalize everything, but the number of processes will explode.
If you are running of a cluster though, you can generate a HicExperiment
object
for each interaction, and processes them in different nodes.
Generally a chunk_size
of 300~500 bins, gives a good trade off. You fist test
on the smaller chromosomes and pick the paramters that suits you.
gw_interactions <- getDCIs_HicExpr(gw_interactions,
windowSize = 3,
alpha = 0.7,
chunk_size=500,
method = "quantreg",
quantile = 0.99,qvalue = 1e-6,isrOP_qval = F,
nbProcessors = 8,verbose = TRUE)
the .hic
format is one of the widely adopted chromatin interaction storage
format. Hopefully, Aiden's group have developed a wrapper function
straw available in their github repository.
We took advantage of the available C++ wrapper and interated it into FIND
,
unfortunatly due to the way C++ long
type are codded between windows and linux,
the code does not work in windows and may lead to a system crash.
Bellow is an example showing how deal with such data using the readJuiceBoxFormat
function provided by FIND
.
```r
# We need to know the chromosomes lengths
require(BSgenome.Hsapiens.UCSC.hg19)
seqlen = seqlengths(BSgenome.Hsapiens.UCSC.hg19)[1:22]
# We load two K562 replicates and GM12878
hic_mats = list(K562_reps1 = "GSM1551620_HIC071_30.hic", K562_reps2 = "GSM1551623_HIC074_30.hic",
GM12878_reps1 = "GSM1551574_HIC025_30.hic", GM12878_reps2 = "GSM1551575_HIC026_30.hic")
As it have been reported in many diffrential chromatin interaction analysis,
such as in diffHic
, regioins with Copy Number Variation are hard to interprete
and the observed different might be just due to CNV. One can use some methods
to correct for this bias, or can ignore these regions directly.
Here we will just remove these regions. We will use the already calculated CNV regions for the K562 downloaded from GSM999287
## We use this function as a proxy to call
fct = function(chrom,seqlen, hicfile, resolution){
intdata = readJuiceBoxFormat(hicfile,
chromosome=chrom,
normMethod="VC_SQRT",
resolution=resolution)
smry <- summary(intdata)
bins <- tileGenome(seqlen[chrom],tilewidth=resolution, cut.last.tile.in.chrom=TRUE)
bins$name <- 1:length(bins)
# correct size
intdata <- sparseMatrix(i=smry$i,j=smry$j,x=smry$x, dims=c(length(bins),length(bins)))
HTCexp(intdata, bins, bins)
}
## Read the list of .hic matrices
hic_matrices = list()
for(hic in names(hic_mats)){
# here we are using chromosom 22, just to test
chroms =names(seqlen)[22]
hic_lst = mclapply(chroms,fct,
hicfile= hic_mats[[hic]],
resolution=5000,
seqlen=seqlen,mc.cores=4
)
hic_matrices[[hic]] = HTClist(hic_lst)
}
## We can use the MA-plot technique to do a joint normalization between replicates
# define experiment design
groups=c(0,0,1,1)
# Create a HicExperiment object
hicCol <- new("HicExperiment",
hic_matrices= hic_matrices,
groups=groups,
DCIs=list(),
genome="hg19")
Ideally, we should do inter-replicates normalization
otherwise, many highly variable regions will be reported as DCI, some packages
such as HiCcompare
already implement the inter-replicates normalization using
the M-Distance plot for normalization. Another good strategy used by HiCcompare
,
is the weighted normalization of the M values (log2(f1/f2)
) using the A values
(log2((f1+f2)/2)
) as low frequency values tend to be very variable and noisy.
# lunch parallel processing.
hicCol <- getDCIs_HicExpr(HicExpr=hicCol,
windowSize = 3,
alpha = 0.7,
chunk_size=500,
method = "quantreg",
quantile = 0.99,
qvalue = 1e-6,
isrOP_qval = F,
nbProcessors = 8,
verbose = TRUE)
One concern when dealing with Hi-C replicates is the replicability of the results. In our manuscript we used the MA-normalization, which is ok, but it was not designed for the Hi-C context. Hopefully, some new strategies are getting developped to deal with this concern, for example we can refer the user to methods such as :
For more details you can the preprint by Nobel's group (https://www.biorxiv.org/content/early/2018/02/05/188755)
Copy number variation can be a source of bias in Cancer cell samples and can introduce many false positive differential interactions. One option is to try to correct for this bias using tools shuch as :
Hi-C matrices are very large, their processing can take time and space. When
developping FIND
, we tried to uses different tricks to improve the processing
seed. The usage of a the sparse matrix class dgCMatrix
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux
##
## Matrix products: default
## BLAS/LAPACK: /research/rgs01/home/clusterHome/mdjekide/miniconda3/envs/r_4.1.1/lib/libopenblasp-r0.3.18.so
##
## locale:
## [1] C
##
## attached base packages:
## [1] stats4 parallel grid stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-3 dplyr_1.0.10 ggsci_2.9
## [4] edgeR_3.36.0 limma_3.50.0 gridExtra_2.3
## [7] rasterVis_0.51.2 lattice_0.20-45 raster_3.4-13
## [10] sp_1.5-0 mvtnorm_1.1-3 FIND_1.0.3
## [13] HiTC_1.36.0 GenomicRanges_1.46.0 GenomeInfoDb_1.30.0
## [16] IRanges_2.28.0 S4Vectors_0.32.0 BiocGenerics_0.40.0
## [19] doParallel_1.0.17 iterators_1.0.14 foreach_1.5.2
## [22] data.table_1.14.4 zlibbioc_1.40.0 bigmemory_4.6.1
## [25] Matrix_1.5-1 png_0.1-7 BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] bigmemory.sri_0.1.6 colorspace_2.0-3
## [3] RcppEigen_0.3.3.9.2 rjson_0.2.21
## [5] class_7.3-19 XVector_0.34.0
## [7] hexbin_1.28.2 listenv_0.8.0
## [9] MatrixModels_0.5-0 prodlim_2019.11.13
## [11] fansi_1.0.3 lubridate_1.8.0
## [13] codetools_0.2-18 splines_4.1.1
## [15] knitr_1.36 readbitmap_0.1.5
## [17] pROC_1.18.0 Rsamtools_2.10.0
## [19] caret_6.0-92 BiocManager_1.30.16
## [21] compiler_4.1.1 assertthat_0.2.1
## [23] bmp_0.3 fastmap_1.1.0
## [25] cli_3.4.1 formatR_1.12
## [27] htmltools_0.5.3 quantreg_5.86
## [29] tools_4.1.1 igraph_1.3.5
## [31] gtable_0.3.1 glue_1.6.2
## [33] GenomeInfoDbData_1.2.7 reshape2_1.4.4
## [35] Rcpp_1.0.9 Biobase_2.54.0
## [37] vctrs_0.5.0 Biostrings_2.62.0
## [39] nlme_3.1-153 conquer_1.2.1
## [41] rtracklayer_1.54.0 timeDate_4021.104
## [43] gower_1.0.0 xfun_0.28
## [45] stringr_1.4.1 globals_0.16.1
## [47] lifecycle_1.0.3 restfulr_0.0.13
## [49] XML_3.99-0.12 terra_1.5-21
## [51] future_1.28.0 zoo_1.8-11
## [53] MASS_7.3-54 scales_1.2.1
## [55] ipred_0.9-13 MatrixGenerics_1.6.0
## [57] SummarizedExperiment_1.24.0 SparseM_1.81
## [59] yaml_2.3.6 ggplot2_3.3.6
## [61] rpart_4.1-15 latticeExtra_0.6-29
## [63] stringi_1.7.6 imager_0.42.13
## [65] highr_0.9 BiocIO_1.4.0
## [67] tiff_0.1-10 hardhat_1.2.0
## [69] BiocParallel_1.28.0 lava_1.6.10
## [71] rlang_1.0.6 pkgconfig_2.0.3
## [73] matrixStats_0.62.0 bitops_1.0-7
## [75] evaluate_0.14 purrr_0.3.5
## [77] GenomicAlignments_1.30.0 recipes_1.0.1
## [79] tidyselect_1.2.0 parallelly_1.32.1
## [81] plyr_1.8.7 magrittr_2.0.3
## [83] R6_2.5.1 generics_0.1.3
## [85] DelayedArray_0.20.0 DBI_1.1.3
## [87] pillar_1.8.1 withr_2.5.0
## [89] survival_3.2-13 RCurl_1.98-1.9
## [91] nnet_7.3-16 tibble_3.1.8
## [93] future.apply_1.9.1 crayon_1.5.2
## [95] uuid_1.0-3 utf8_1.2.2
## [97] rmarkdown_2.11 jpeg_0.1-9
## [99] locfit_1.5-9.4 ModelMetrics_1.2.2.2
## [101] digest_0.6.30 munsell_0.5.0
## [103] viridisLite_0.4.1