HTTPS SSH

<!-- README.md is generated from README.Rmd. Please edit that file -->

ctlgeo <img src="man/figures/ctlverse-sticker-01.png" align="right" width="200"/>

The goal of ctlgeo is to ease loading and parsing of (Gene Expression Omnibus) GEO data. All that is required is the identifiers of the desired GSE (experiment-level) data (“GSEXXXXX”). From there, the data can then be (down)loaded, and parsed in a tidy fashion for easier analysis. Originally designed for analysis of Cytotoxic T Lymphocyte (CTL) data, but functions are generalizable for any GEO-based data. A part of the ctlverse.

Example

To download a given file, you’ll first want to get the GSE identifier. Online tools such as GEOracle are handy for broader mining, but usually you’ll have some in mind from select publications.

Once you have the GSE number(s), you’ll first download your data locally. The easiest files to work with are the soft family matrix file, which contains both metadata and expression data for downstream analyses. You’ll grab this data from the web via ctlgeo::soft_grab_data():

## ctlgeo::soft_build_url("GSE561") # if you want to see url yourself
ctlgeo::soft_grab_data(gse = c("GSE561", "GSE100807"), dir = "path/to/soft_data")

Then you’ll need to load the downloaded data via one of the family of functions in ctlgeo::soft_load_data*(), either by providing the gse and dir arguments again (ctlgeo::soft_load_data()), or by loading from specific paths (ctlgeo::soft_load_data_file()) or an entire directory of GSE soft files (ctlgeo::soft_load_data_dir()). This will return a list of S4 objects derived from the matrix files, parsed thanks to the GEOquery::getGEO package.

## The family of functions is `ctlgeo::soft_load_data*()`
soft_list <- ctlgeo::soft_load_data_dir(dir = "path/to/soft_data")

Then you can start getting to the exciting part of parsing metadata. First convert the list to a tibble:

soft_tbl <- ctlgeo::soft_list_to_tbl(soft_list)

Then parse metadata for each S4 at both the experimental (GSE) and sample (GSM) level for each soft S4 object in the tibble:

soft_tbl <- soft_tbl %>%
    mutate(meta_gse = map(S4, ctlgeo::soft_parse_meta_gse),
           meta_gsm = map(S4, ctlgeo::soft_parse_meta_gsm))

tidyr::unnest(soft_tbl, meta_gse)[1:2, c(1, 5:6)]
#> # A tibble: 2 x 3
#>   GSE       contact_city contact_country
#>   <chr>     <chr>        <chr>          
#> 1 GSE100807 Houston      USA            
#> 2 GSE25846  Iowa City    USA
tidyr::unnest(soft_tbl, meta_gsm)[1:2, c("GSE", "GSM", "title")]
#> # A tibble: 2 x 3
#>   GSE       GSM        title           
#>   <chr>     <chr>      <chr>           
#> 1 GSE100807 GSM2693617 CD4-ICOSpos-B430
#> 2 GSE100807 GSM2693618 CD4-ICOSpos-B431

Source code

Code can be found at Bitbucket. Feedback, comments, and discussion always welcomed via issues.