HTTPS SSH

<!-- README.md is generated from README.Rmd. Please edit that file -->

ctlcon <img src="man/figures/ctlverse-sticker-01.png" align="right" width="200"/>

The goal of ctlcon is to organize high-throughput sequencing data derived from CD8+ T cells (aka Cytotoxic T Lymphocytes) into a consensus R package for data portability, accessibility, and reproducibility. This is the first package of the ctlverse, which is envisioned to become a resource tailored to the analysis of CD8+ T cells. The ctlverse strives to use modern R idioms, and is particularly inspired by the tidyverse in its development style.

Installation

You can install the development version of ctlcon from bitbucket with:

devtools::install_bitbucket("robert_amezquita/ctlcon")

Additionally, I recommend installing the tidyverse family of package (install.packages('tidyverse')) prior to loading the package.

Scope

Currently the package focuses on CTL data derived from mice in the context of acute viral infection (generally LCMV), and currently works on data from three laboratories - Pereira, Kaech, and Goldrath. Data is generally derived from naive (N), bulk effector (E), bulk memory (M), and effector cell subsets memory precursors (MP) and terminal effectors (TE) (as defined by IL7R and KLRG1 expression).

Future development may involve adding additional datasets or expanding into human CTL immunology. To facilitate adding datasets, I believe it is imperative to process all data through the same pipeline (see seqsnake), and I would be happy to collaborate in expanding the datasets included in this package.

Data Preprocessing

Preprocessing procedures are outside the scope of this package, but generally include quality control, alignment, and peak calling procedures. For more information, see seqsnake and/or contact me. While raw data is not accessible due to file size limitations, please contact me to coordinate data sharing, however, note that all data is accessible via the Gene Expression Omnibus and/or Sequence Read Archive.

Package Organization

All final user-facing outputs are stored in the data/ folder, and once the package is loaded, the objects therein are accessible via their name. Available datasets are most easily viewed via the online reference at ctlcon.netlify.com or the package manual.

To learn more about how individual objects were crafted, see the data-raw folder for relevant R scripts.

Data Organization

Data can be accessed via the various data objects stored in the package. Each object has its own documentation which can be accessed via the help function ?. Generally, the data can be organized into the following categories below.

Genome Annotation

A consistent genome annotation is necessary for the integration of data from various -omics types. Ensembl based annotations are the primary reference used throughout the package, with associated Entrez IDs and gene symbols (for human readability) as secondary annotations. See genes_mm10, transcripts_mm10 for gene and transcripts, respectively. transcripts_mm10 is the primary resource from which all other genome annotations are derived (including those below), and is defined based on the latest Ensembl build GRCm38.p5 (release 91).

The tss_mm10 is for the annotation of genomic regions data (for example from ChIP-seq) to nearest genes. Some care should be taken in using this as TSS’s are defined by transcript start sites. It may be desirable to collapse the object into gene-level annotations.

For some annotations, it may be desirable instead to annotate to an entire genomic range as opposed to simply the start site. For such uses, precalculated ranges_transcripts_mm10 and ranges_genes_mm10 are provided as well.

homologenes can be used to map between mouse and human genes using Entrez IDs.

Genesets

Genesets from MSigDB are tidied up for use in downstream analyses, originally from msigdb_hg38 and mapped to mouse via homologenes as msigdb_mm10.

Sample Annotation

The mapping object contains annotation data for each sample used throughout the package for data described below.

Expression Data

The txi_db object contains raw counts expression data derived by Kallisto, and normalized downstream transforms of the data are included for ease of use, including DESeq2 objects dds and rlog, as well as tibbles derived from these objects as results_tbl and rlog_tbl.

Region Data

Results derived from ATAC-seq or ChIP-seq via MACS2 are organized into region files, generally in bed-compatible format, which at minimum includes a chromosome, start, and end region, and additional columns describing characteristics of the region (generally annotation information and peak call statistics). Raw outputs can be viewed by from narrowpeak, broadpeak, and summit.

For replicated samples, it is desirable to calculate consensus regions common to shared conditions or cell types. These can be accessed by inspecting consensuspeak, and assaypeak.

Differential Region Analyses

Raw counts were derived for the various regions and also imputed into DESeq2 for differential analysis of regions across conditions, accessible via assaypeak_htseq and assaypeak_deseq, respectively.

Source Code

All source code is viewable on Bitbucket at https://bitbucket.com/robert_amezquita/ctlcon. Please feel free to submit issues to provide feedback, requests, or start a conversation about development, and following up by contributing via pull requests to the repo.

Contact

For questions and feedback, please email me at robert.amezquita@fredhutch.org or submit issues to the Bitbucket repo.