orange-bioinformatics / docs / reference / obiGsea.htm


<title>obiGsea: Gene Set Enrichment Analysis</title>
<link rel=stylesheet href="style.css" type="text/css">
<link rel=stylesheet href="style-print.css" type="text/css" media=print>

<h1>obiGsea: Gene Set Enrichment Analysis</h1>
<index name="modules/gene set enrichment analysis">

<p>Gene Set Enrichment Analysis (GSEA) is a method which tries to identify groups of genes that are
regulated together. It is implemented in module obiGsea, which is included in Orange for Functional Genomics package.
To use obiGsea you need to install Orange for Functional Genomics.</p>

<p>GSEA takes gene
expression data for multiple samples with their phenotypes and computes
gene set enrichment for given gene sets. To use it run
<code>runGSEA</code> method with the following arguments:</p>

<index name="gsea">
<index name="GSEA">
<index name="gene+set">

<p class=section>Arguments</p>
<dl class=arguments>

  <dd>An <A href="ExampleTable.htm"><CODE>ExampleTable</CODE></A> with gene expression data.</dd>

  <dd>An initialized gene matcher.</dd> 

  <dd>A pair of class values describing phenotypes that are chosen as two distinct phenotypes on which gene correlations
  are computed. Only examples with one of chosen class values are considered for analysis. If not specified, all in <code>classVar</code> attribute descriptor are used.</dd>

  <dd>A python dictionary of gene sets, where key is a gene set name which points to a list of gene aliases for genes
  in the gene set. Default: gene sets in your collection directory.</dd>

  <dd>GSEA computes gene set significance by permutation tests. This parameter specifies the number
  of permutations. Default: 100.</dd>

  <dd>Type of permutation. If <code>"class"</code>, class values (phenotypes) are permuted. This is the default. 
  However, if number of samples is small (less than 10), it is advisable to use <code>"gene"</code> permutations even
  though they ignore gene-gene interactions.</dd>

  <dt>minSize, maxSize</dt>
  <dd>Minimum and maximum number of genes from gene set also present in the data set for that gene set to be analyzed.
  Defaults: 3 and 1000.</dd>

  <dd>Minimum fraction of genes from the gene set also present in the data set for that gene set to be analyzed. Default: 0.1.</dd> 

  <dd>Number of valid attribute values for each class value needed for a certain attribute (gene) to be considered in analysis. Attributes failing to meet this criterium are ignored. Default: 3.</dd>

  <dd>Explains marking of different phenotypes. By default the <CODE>data.domain.classVar</CODE> is used to distinguish them, if it exists. If the argument is set to <code>False</code>, GSEA presumes that the genes are already ranked. If the value of <code>phenVar</code> is a string, then an entry from <code>attributes</code> dictionary of individual attributes specifies the phenotype. In latter case, each attribute represents one sample.</dd>

  <dd>Specifies locations of gene names. If <code>True</code>, gene names are attribute names. If it is a string, then they are taken from the corresponding entry from <code>attributes</code> dictionary of individual attributes. If each attribute specifies a sample, then the user should pass the meta variable containing the gene names. Defaults to attribute names if each example specifies one sample.</dd>


<!-- es, nes, pval, fdr, os, ts, genes -->

Method <code>runGSEA</code> returns a dictionary where key is a gene set label and its value a dictionary
<li> enrichment score (key <code>es</code>),
<li> normalized enrichment score (key <code>nes</code>),
<li> P-value (key <code>p</code>),
<li> FDR (key <code>fdr</code>),
<li> whole gene set size (key <code>size</code>),
<li> number of matched genes from the gene set (key <code>matched_size</code>),
<li> gene names from the data set for matched genes from the gene set (key <code>genes</code>).

<p>A note on gene name matching. Gene name matching is performed with the help of KEGG database. 
A gene from a gene set is tried to be matched with a gene from the data set. If an alias for a gene from the
gene set is the same as an alias for a gene in the data set, then those aliases are matched. If not,
it is checked if gene alias from the gene set and gene alias from the data set are both gene
aliases of the same gene according to KEGG database for a given organism. If they are, we have a match.</p>

<h3>Example 1</h3>

<p>We present a simple usage example. Data used here are not gene expression
data. For the method to work we had to specify our own sets of attributes that seem to "belong together".</p>

<p class="header"><a href=""></a> (uses <a href=

<xmp class=code>import orange, obiGsea, obiGene

data = orange.ExampleTable("iris")

gen1 = dict([
    ("sepal",["sepal length", "sepal width"]),
    ("petal",["petal length", "petal width", "petal color"])

res = obiGsea.runGSEA(data, matcher=obiGene.matcher([]), minSize=2, geneSets=gen1)
print "%5s  %6s %6s %s" % ("LABEL", "NES", "P-VAL", "GENES")
for name,resu in res.items():
    print "%5s  %6.3f %6.3f %s" % (name, resu["nes"], resu["p"], str(resu["genes"]))

<p>Corresponding output:</p>

<xmp class=code>LABEL     NES  P-VAL GENES
petal  -1.117  0.771 ['petal length', 'petal width']
sepal   1.087  0.630 ['sepal length', 'sepal width']

<p>We can see that a "gene" labeled "petal color" was not used, because it couldn't be matched to any attribute in the data set.</p>

<h3>Example 2: using correlation data directly</h3>

GSEA can also directly use correlation data between individual genes and a phenotype. Two kinds of input data trigger this functionality: (1) input data with only one example (attribute names are gene names) or (2) there is only one continuous attribute in the given data set's domain. In latter case gene correlations are read from the continuous attribute and the gene names are taken from the first <code>StringVariable</code> in the domain. We have choosen the latter representation for this example, since this is the formatting returned by <code>obiDicty</code> module. Only results for ten pathways are listed.

<p class="header"><a href=""></a></p>

<xmp class=code>import obiDicty
import obiGeneSets
import obiGsea
import orange
import obiGene

dbc = obiDicty.DatabaseConnection()
data = dbc.getData(sample='pkaC-', time="8")[0] #get first chip

print "First 10 examples"
for ex in data[:10]:
    print ex


genesets =  obiGeneSets.collections([":kegg:ddi"])
res = obiGsea.runGSEA(data, matcher=matcher, minPart=0.05, geneSets=genesets,

print "GSEA results"
print "%-40s %6s %6s %6s %7s" % ("LABEL", "NES", "P-VAL", "SIZE", "MATCHED")
for name,resu in res.items()[:10]:
    print "%-40s %6.3f %6.3f %6d %7d" % (name[:30], resu["nes"], resu["p"],
        resu["size"], resu["matched_size"])

<p>Corresponding output:</p>

<xmp class=code>First 10 examples
[-0.055], {"DDB":'#*'}
[0.003], {"DDB":'#17S_ribosomal_gene'}
[-0.168], {"DDB":'#B-elongation_factor'}
[-0.134], {"DDB":'DDB_G0272032'}
[0.330], {"DDB":'DDB_G0291982'}
[0.229], {"DDB":'DDB_G0268264'}
[0.108], {"DDB":'DDB_G0293286'}
[-0.130], {"DDB":'DDB_G0284295'}
[0.189], {"DDB":'DDB_G0285435'}
[0.082], {"DDB":'DDB_G0287595'}
GSEA results
LABEL                                       NES  P-VAL   SIZE MATCHED
[KEGG] Alanine and aspartate m            1.367  0.146     21      12
[KEGG] Cysteine metabolism                1.328  0.098      8       3
[KEGG] Folate biosynthesis                0.703  0.862     12       4
[KEGG] 3-Chloroacrylic acid de           -0.734  0.750      4       3
[KEGG] Porphyrin and chlorophy           -0.990  0.517     14       5
[KEGG] Drug metabolism - other            0.951  0.545     15      12
[KEGG] Sulfur metabolism                 -0.538  0.980      6       3
[KEGG] N-Glycan biosynthesis              1.384  0.111     22       6
[KEGG] Terpenoid biosynthesis            -1.316  0.125      7       3
[KEGG] Aminoacyl-tRNA biosynth           -1.105  0.292     42      12

<h3>Example 3: attributes as samples (from Gene Expression Omnibus)</h3>

Data sets where attributes represents samples can also be used, if <code>phenVar</code> and <code>geneVar</code> parameters are specified. 
Only results for top ten pathways are listed. To obtain valid results, we would have to increase the number of data set permutations (parameter <code>n</code>).

<p class="header"><a href=""></a></p>

<xmp class=code>import obiGeneSets
import obiGsea
import orange
import obiGene
import obiGEO

import obiGEO
gds = obiGEO.GDS("GDS10")
data = gds.getdata() 

print "Possible phenotype descriptors:"
print map(lambda x: x[0], obiGsea.allgroups(data).items())


phenVar = "tissue"
geneVar = "gene" #use gene meta variable for gene names

genesets =  obiGeneSets.collections([":kegg:hsa"])
res = obiGsea.runGSEA(data, matcher=matcher, minPart=0.05, geneSets=genesets, 
    permutation="class", n=10, phenVar=phenVar, geneVar=geneVar)

print "GSEA results (choosen descriptor: tissue)"
print "%-40s %6s %6s %6s %7s" % ("LABEL", "NES", "FDR", "SIZE", "MATCHED") 
for name,resu in sorted(res.items(), key=lambda x: x[1]["fdr"])[:10]: 
    print "%-40s %6.3f %6.3f %6d %7d" % (name[:30], resu["nes"], resu["fdr"], 
        resu["size"], resu["matched_size"]) 

<p>Corresponding output:</p>

<xmp class=code>Possible phenotype descriptors:
['disease state', 'strain', 'tissue']

GSEA results (choosen descriptor: tissue)
LABEL                                       NES    FDR   SIZE MATCHED
[KEGG] DNA replication                    1.783  0.000     36      35
[KEGG] Cell cycle                         1.832  0.000    119      96
[KEGG] T cell receptor signali            2.006  0.000     94      70
[KEGG] Non-homologous end-join            1.795  0.000     14      13
[KEGG] Glycine, serine and thr           -1.828  0.022     42      33
[KEGG] Porphyrin and chlorophy           -1.829  0.028     41      24
[KEGG] Thyroid cancer                     1.666  0.036     29      24
[KEGG] Adipocytokine signaling           -1.880  0.037     67      53
[KEGG] Ubiquitin mediated prot            1.597  0.051    137     112
[KEGG] Glycosaminoglycan degra           -1.899  0.055     18      16


<P>Subramanian, Aravind   and Tamayo, Pablo   and Mootha, Vamsi  K.  and Mukherjee, Sayan   and Ebert, Benjamin  L.  and Gillette, Michael  A.  and Paulovich, Amanda   and Pomeroy, Scott  L.  and Golub, Todd  R.  and Lander, Eric  S.  and Mesirov, Jill  P. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 2005.</P>