1. biolab
  2. Untitled project
  3. orange-bioinformatics


orange-bioinformatics / docs / reference / obiGEO.htm

<LINK REL=StyleSheet HREF="style.css" TYPE="text/css">
<LINK REL=StyleSheet HREF="../style-print.css" TYPE="text/css" MEDIA=print></LINK>

<h1>obiGEO: an interface to NCBI's Gene Expression Omnibus</h1>

<index name="NCBI">
<index name="Gene Expression Omnibus">
<index name="microarray data sets">

<p>obiGEO provides an interface
to <a href="http://www.ncbi.nlm.nih.gov/">NCBI</a>'s 
<a href="http://www.ncbi.nlm.nih.gov/geo/">Gene Expression Omnibus</a>
repository. Currently, it only supports
<a href="http://www.ncbi.nlm.nih.gov/sites/GDSbrowser">GEO
DataSets</a> information querying and retreival.</p>


<p><INDEX name="classes/GDSInfo (in obiGEO)">GDSInfo is the class that
    can be used to retreive the infomation about
    <a href=http://www.ncbi.nlm.nih.gov/sites/GDSbrowser>GEO Data
    Sets</a>. The class accesses the Orange server file
    that either resides on the local computer or is
    automatically retreived from Orange server. Notice that the call
    of this class does not access any NCBI's servers directly.</p>

<p class=section>Methods</p>
<dl class=attributes>
<dd><p>Constructor returning the object with GEO DataSets
  information. If <code>force_update</code> is set
  to <code>True</code>, the constructor will download GEO DataSets
  information file (gds_info.pickled) from Orange server, otherwise,
  it will first check if the local copy exists. The object returned
  behaves like a dictionary: the keys are GEO DataSets IDs, and the
  dictionary values for is a dictionary providing various information
  about the particular data set.</p>

<xmp class=code>>>> import obiGEO
>>> info = obiGEO.GDSInfo()
>>> info.keys()[:5]
>>> ['GDS2526', 'GDS2524', 'GDS2525', 'GDS2522', 'GDS1618']
>>> info['GDS2526']['title']
'c-MYC depletion effect on carcinoma cell lines'
>>> info['GDS2526']['platform_organism']
'Homo sapiens'


<p><INDEX name="classes/GDSInfo (in obiGEO)">GDS is a class that
    provides methods for retreival of a specific GEO DataSet. The data
    is provided as Orange's ExampleTable.

<p class=section>Methods</p>
<dl class=attributes>
<dt>GDS(gdsname, verbose=False, force_download=False)</dt>
<dd>Constructor returning the object to be used to retreive GEO
  DataSet table (samples and gene expressions). <code>gdsname</code>
  is an NCBI's ID for the data set in the form "GDSn" where "n" is a
  GDS ID number. Construct checks a local cache directory if the
  particular data file is loaded locally, else it downloads it from
  <a href="ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS/">NCBI's GEO
  FTP site</a>. The download is forced
  if <code>force_download=True</code>. The compressed data file
  resides in the cache directory after the call of the constructor
  (call to <code>orngServerFiles.localpath("GEO")</code> reveals the
  path of this directory).</p>

<xmp class=code>>>> import obiGEO
>>> gds = obiGEO.GDS("GDS1676")
>>> print print ", ".join(gds.genes[:10])
>>> gds.info["title"]
'T cell leukemia cell response to human herpesvirus 6 infection: time course'
>>> print gds
GDS1676 (Homo sapiens), samples=8, features=2100, genes=667, subsets=8

<dt>getdata(report_genes=True, transpose=False,
merge_function=variableMean, sample_type=None,
<dd><p>The call of this method returns the data from GEO DataSet in
  Orange format. Micorarray spots reported in the GEO data set can
  either be merged according to their gene id's
  (<code>report_genes=True</code>) or can be left as spots. The data
  matrix can have spots/genes in rows and samples in columns
  (default, <code>transpose=False</code>) or samples in rows and
  spots/genes in columns
  (<code>transpose=True</code>). Argument <code>sample_type</code>
  defines the type of annotation, or (if <code>transpose=True</code>)
  the type of class labels to be included in the data set. Namely,
  with <code>sample_type</code>, the entire annotation of samples will
  be included either in the class value or in
  the <code>.attributes</code> field of each data set
  attributes. Spots with sample profiles that include unknown values
  are retained by default (<code>remove_unknown=None</code>). They are
  removed if the proportion of samples with unknown values
  is above the threshold set by <code>remove_unknown</code>.</p>

<p>The following illustrates how <code>getdata</code> is used to
  construct a data set with genes in rows and samples in
  columns. Notice that the annotation about each sample is retained
  in <code>.attributes</code>. 

<xmp class=code>>>> import obiGEO
>>> gds = obiGEO.GDS("GDS1676") 
>>> data = gds.getdata()
>>> len(data)
>>> data[0]
[?, ?, -0.803, 0.128, 0.110, -2.000, -1.000, -0.358], {"gene":'EXO1'}
>>> data.domain.attributes[0]
FloatVariable 'GSM63816'
>>> data.domain.attributes[0].attributes
Out[191]: {'dose': '20 U/ml IL-2', 'infection': 'acute ', 'time': '1 d'}



<p>The following script prints out some information about a specific data set. It does not download the data set, just uses the (local) GEO data sets information file.</p>

<p class="header"><a href="geo_gds1.py">geo_gds1.py</a></p>
<xmp class=code>import obiGEO
import textwrap

gdsinfo = obiGEO.GDSInfo()
gds = gdsinfo["GDS10"]

print "ID:", gds["dataset_id"]
print "Features:", gds["feature_count"]
print "Genes:", gds["gene_count"]
print "Organism:", gds["platform_organism"]
print "PubMed ID:", gds["pubmed_id"]
print "Sample types:"
for sampletype in set([sinfo["type"] for sinfo in gds["subsets"]]):
    ss = [sinfo["description"] for sinfo in gds["subsets"] if sinfo["type"]==sampletype]
    print "  %s (%s)" % (sampletype, ", ".join(ss))
print "Description:"
print "\n".join(textwrap.wrap(gds["description"], 70))

<p>The output of this script is:</p>

<xmp class=code>ID: GDS10
Features: 39114
Genes: 20094
Organism: Mus musculus
PubMed ID: 11827943
Sample types:
  disease state (diabetic, diabetic-resistant, nondiabetic)
  strain (NOD, Idd3, Idd5, Idd3+Idd5, Idd9, B10.H2g7, B10.H2g7 Idd3)
  tissue (spleen, thymus)

Examination of spleen and thymus of type 1 diabetes nonobese diabetic
(NOD) mouse, four NOD-derived diabetes-resistant congenic strains and
two nondiabetic control strains.

<p>GEO data sets provide a sort of mini ontology for sample labeling. Samples belong to sample subsets, which in turn belong to specific types. Like above GDS10, which has three sample types, of which the subsets for the tissue type are spleen and thymus. If you are into using data sets for supervised data mining, then it would be useful to find out which of the data sets provide enough samples for each label. It is (semantically) convenient to perform classification within sample subsets of the same type. We therefore need a script that go through the entire set of data sets and finds those for which, for a specific type, there are enough samples within each of the subsets. The following script does the work. The function <code>valid</code> is passed the information about the data set and determines which subset types (if any) satisfy the "validity" criteria. The number of requested samples in the subset is by default set to <code>n=40</code>.</p>

<p class="header"><a href="geo_gds5.py">geo_gds5.py</a></p>
<xmp class=code>import obiGEO

def valid(info, n=40):
    """Return a set of subset types containing more than n samples in every subset"""
    invalid = set()
    subsets = set([sinfo["type"] for sinfo in info["subsets"]])
    for sampleinfo in info["subsets"]:
        if len(sampleinfo["sample_id"]) < n:
    return subsets.difference(invalid)

def report(stypes, info):
    """Pretty-print GDS and valid susbset types"""
    for id, sts in stypes:
        print id
        for st in sts:
            print "  %s:" % st,
            gds = info[id]
            print ", ".join(["%s/%d" % (sinfo["description"], len(sinfo["sample_id"])) \
                             for sinfo in gds["subsets"] if sinfo["type"]==st])

gdsinfo = obiGEO.GDSInfo()
valid_subset_types = [(id, valid(info)) for id, info in gdsinfo.items() if valid(info)]
report(valid_subset_types, gdsinfo)

<p>The requested number of samples, <code>n=40</code>, seems to be a quite a stringent criteria met - at the time of writing of this documentation - by only a few data sets (you may try to lower this threshold):</p>

<xmp class="code">GDS1611
  genotype/variation: wild type/48, upf1 null mutant/48
  agent: none/57, UV/57, IR/57
  other: non-neural/50, neural/100
  gender: male/82, female/48
  tissue: raphe magnus/40, somatomotor cortex/41
  disease state: control/41, Marfan syndrome/60
  tissue: raphe magnus/40, somatomotor cortex/43
  protocol: no treatment/47, hormone replacement therapy/42

<p>Let us now pick one data file from the above (GDS2960) and see if we can predict the disease state. We will use LinearLearner, a fast variant of support vector machines with linear kernel, and within 10-fold cross validation measure AUC, the area under ROC. AUC is the probably for correctly distinguishing between two classes if picking the sample from target (e.g., the disease) and non-target class (e.g., control).</p>

<p class="header"><a href="geo_gds6.py">geo_gds6.py</a></p>
<xmp class="code">import obiGEO
import orange
import orngTest
import orngStat

gds = obiGEO.GDS("GDS2960")
data = gds.getdata(sample_type="disease state", transpose=True)
print "Samples: %d, Genes: %d" % (len(data), len(data.domain.attributes))

learners = [orange.LinearLearner]
results = orngTest.crossValidation(learners, data, folds=10)
print "AUC = %.3f" % orngStat.AUC(results)[0]

<p>The output of this script is:</p>

<xmp class="code">Samples: 101, Genes: 3979
AUC = 0.985</xmp>

<p>The AUC for this data set is very high, indicating that using this particular gene expression data it is almost trivial to separate the two classes.</p>