# Prior Knowledge Driven Granger causality Analysis on Gene Regulatory Network Discovery#

Shun Yao1,2, Shinjae Yoo2, Dantong Yu2

1Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, NY 11794, USA

2Computational Science Center, Brookhaven National Laboratory, Upton, NY 11793, USA

Data and Soure Code link: https://bitbucket.org/dtyu/granger-causality

Our study focuses on gene regulatory network discovery from time series gene expression data. Among different categories of methodologies, Granger causality (GC) modeling has become a popular method due to its efficiency and good performance. Yet, traditional GC modeling methods could not be applied to real biological dataset easily. Thus how to improve GC modeling strategies for real biological dataset is our main research direction. In this research, we proposed a new method CGC-2SPR to solve the lack of information problem in the time dimension through incorporating prior biological knowledge.

Gene regulatory network prediction methods

One of the popular approaches for predicting gene regulatory networks from time series data is Dynamic Bayesian Networks (DBN). Early researches using the DBN methods are mainly based on boolean network theories. Existing BN inference approaches include REVEAL algorithm, MDL algorithm and other approaches incorporating the prior knowledge. Later studies on DBN models also uses Gaussian distribution and BIC to model the continuous expression values (Zou, 2005; Zhu, 2010), which could achieve better results with a higher computation cost. The performance of DBN method is great, but the gene number (network size) they could handle is quite limited due to a combinatorial time complexity.

Also, mutual information (MI) methods (Meyer, 2008) have been used in a few studies, which is based on information theory. The resulting networks from mutual information methods are generally non-directional. However, there are recent studies that use MI methods to generate directional networks. It is generally a pairwise model thus it is susceptible to noise.

Due to the increased data size in recent years, another family of methods have come into the focus: vector autoregressive (VAR) methods (Tam, 2012; Zou, 2009). One of the most popular VAR methods is Granger causal modeling, which was originally applied in economics but now used in the gene regulatory network inference. Recent researches have tried to compare Granger causality approach with DBN methods with different models. It turned out that Granger causal inference has a similar performance to that of DBN methods with a much less time complexity. With the growing number of genes to analyze, Granger causal inference will be a more preferable method to use due to its computational efficiency.

Introduction to Granger causality modeling

Proposed in the 1960s (Granger, 1969), Granger causality has been proven to be an operational notion of causality (instead of true causality) in time-series data analysis. If there is a causal relationship between two random variables, the past values of the "cause" variable will be helpful in predicting the present and future values of the "effect" variable. It is a type of vector autoagressive models (VAR), which is computationally efficient even when dealing with large number of genes.

Limitations on existing GC approaches

In reality, two types of Granger models are widely used in different studies: bivariate GC model and multivariate GC model. They both have limitations in different aspects (Tam, 2012).

The key problem to the situation is the lack of data in the time dimension (n>>T problem), which is a typical situation in real biological dataset. Thus we come up with the idea of using prior knowledge to assist Granger causality modeling.

Using prior knowledge to assist Granger causality analysis

We reorganize the CGC modeling problem into the optimization form and use regularization terms to mixing prior knowledge into Granger causality analysis.

The prior knowledge information is included in the W matrix, guiding the Granger causality analysis. It is similar to the Ridge regularization method (Hoerl, 1970). Thus there is an efficient solution to the above optimization function:

To incorporate the W matrix correctly, we proposed a two-step process to apply GC modeling, named CGC-2SPR(CGC 2 step Prior knowledge assisted Ridge). As the first step, a normal Ridge regression is performed. Then the sign of the entries in W are selected by the Ridge regression results. Lastly, we incorporated corrected W to the aforementioned formula to perform CGC-2SPR analysis.

Simulation experiments

Previous studies (Zou, 2009; Tam, 2012) have utilized the classic five-variable model to investigate the performance of different gene regulatory network inference methods (shown in figure 3). The structure has not reflected the real biological network structure, which have been described as a hierarchical network with only a few regulators and a lot of downside effectors.

A model that consists of modularized hierarchical networks has been generated by us to simulate the real condition. The basis of our proposed simulation model is rooted on a simple 1=>3=>9 regulatory module, representing the basic unit of the three layer hierarchical regulatory network. Then the basic module is repeated 60 times to form an initial network with 780 nodes. After that random perturbation edges are added to the initial network to make connections between different modules. A simpler network with five repeats is shown to represent the layered structure formed. In addition, to simulate real biological experiment, only half of the top regulators are randomly activated in the network. Last but not least, independent isolated nodes are added to the network to increase the learning difficulty. Through these methods, a more realistic golden standard regulatory network is generated to evaluate different gene regulatory network inference methods.

After generating the golden standard network, the expression values of genes are generated differently. For genes that did not receive regulation from others in the network, the expression values are generated from a periodic AR(2) model.

On the other hand, genes that received regulations from others are generated by the following equation:

If gene j regulates gene k, r(j=>k)~U(-1, 1). Otherwise, r(j=>k) = 0. p(j=>k) represents the model order of the regulation from gene j to k, which is generated randomly from 1 to 3. e(k, t) is the random noise that conforms to the Gaussian distribution N(0, 1).

The expression value of all the genes in the first three time points are generated randomly from Gaussian distribution N(0, 1). The expression value of the remaining time points are following the above rules. All together, a simulation dataset with 1000 genes and 20 time points is generated. To apply CGC-2SPR method, a prior knowledge graph with cliques in each module has been generated to confer group information to the GC modeling (shown in the Figure).

We evaluated the new method CGC-2SPR to other gene regulatory network inference methods including PGC model, Ridge regularization (Hoerl, 1970), Lasso regularization (Tibshirani, 1996), Elastic net regularization (Zou, 2005), MI method MRNET and ARACNE (Meyer, 2008). The precision-recall curve acquired is shown in the Figure.

In practice, biologists usually focus on the most significant edges. In other words, the high precision area are of more interests. Thus the top 1082 causality relationships are picked up in different models to calculate precision P, recall R and then F1 score. The results are shown in the Table. Considering the prior knowledge accuracy is 0.075, we observed "1+1>2" effect in the simulation experiments (0.150 > 0.046 + 0.075).

Experiments on yeast metabolic cycle data

Then we apply our new prior knowledge assisted Granger modeling to real biological dataset with n>>T properties. It is “yeast metabolic cycle” time series gene expression dataset which is collected from the experiments on the well-studied organism Saccharomyces cerevisiae, a.k.a. the baker’s yeast (Tu et al., 2005). The dataset contains 2935 genes and 36 time points after the periodic filtering process. It is z-normalized before applying different network inference modeling strategies.

Two types of prior knowledge for yeast have been considered to assist Granger causality analysis independently in our study. One is “YeastNet” (Lee et al., 2007), which is general functional gene association network. The other one is transcriptional factor (TF) binding profiles of yeast genes (Zhu et al., 2009), which is more specific for helping to predict gene regulatory networks. To compare the effectiveness of different methods, a golden standard is built to evaluate the results, which is based on the functional transcriptional regulatory network generated by genome wide KO experiments (Hu et al., 2007).

Shown in the table, the new methodology CGC-2SPR has clearly shown an advantage over all the other methods. Also, CGC-2SPR method has shown a "1+1>2" effect in the results (109 > 87 + 7). Moreover, different prior knowledge might contribute differently to the performance of CGC-2SPR. The TF binding score prior knowledge performed better in this dataset, indicating that specific prior knowledge in the domain might be more preferred by CGC-2SPR.

Experimental data download link

Experimental data and results will be shared under the same Bitbucket repository.

References

Granger, C.W. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica: Journal of the Econometric Society, pages 424–438.

Tu, B. P., Kudlicki, A., Rowicka, M., and McKnight, S. L. (2005). Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes. Science, 310(5751), 1152–1158.

Hu, Z., Killion, P. J., and Iyer, V. R. (2007). Genetic reconstruction of a functional transcriptional regulatory network. Nature genetics, 39(5), 683–687.

Lee, I., Li, Z., and Marcotte, E. M. (2007). An improved, bias-reduced probabilistic functional gene network of baker’s yeast, Saccharomyces cerevisiae. PloS one, 2(10), e988.

Zhu, C., Byers, K. J., McCord, R. P., Shi, Z., Berger, M. F., Newburger, D. E., Saulrieta, K., Smith, Z., Shah, M. V., Radhakrishnan, M., et al. (2009). High-resolution DNA-binding specificity analysis of yeast transcription factors. Genome research, 19(4), 556–566.

Zou, C., Feng, J.: Granger causality vs. dynamic bayesian network inference: a comparative study. BMC bioinformatics 10(1), 122 (2009)

Tam, G.H.F., Chang, C., Hung, Y.S.: Application of Granger causality to gene regulatory network discovery. In: Systems Biology (ISB), 2012 IEEE 6th International Conference On, pp. 232-239 (2012)

Meyer, P.E., Lafitte, F., Bontempi, G.: minet: A R/bioconductor package for inferring large transcriptional networks using mutual information. BMC bioinformatics 9(1), 461 (2008)

Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2), 301-320 (2005)

Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267-288 (1996)

Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12(1), 55-67 (1970)

Zou, M., Conzen, S.D.: A new dynamic bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data. Bioinformatics 21(1), 71-79 (2005)

Zhu, J., Chen, Y., Leonardson, A.S., Wang, K., Lamb, J.R., Emilsson, V., Schadt, E.E.: Characterizing dynamic changes in the human blood transcriptional network. PLoS computational biology 6(2), 1000671 (2010)