Wiki

Clone wiki

rPHG / Home

Getting Started with rPHG

Author: Brandon Monier

This page will soon be moving!

This wiki will be moving to a new location on February 10, 2023. Be on the lookout for https://maize-genetics.github.io/rPHG/

Table of Contents

NOTE: If you are using Java 16+ or Java 8 (1.8), you may run into errors when trying pull reference ranges and table data from the BrAPI client! This software has been built and tested on Java 11.

Introduction

Overview

Thanks for checking out rPHG! In this document, we will go over the functionalities used to work with the practical haplotype graph (PHG) API via R.

The PHG is a trellis graph based representation of genic and intergenic regions (called reference ranges or reference intervals) which represent diversity across and between taxa. It can be used to: create custom genomes for alignment, call rare alleles, impute genotypes, and efficiently store genomic data from many lines (i.e. reference, assemblies, and other lines). Skim sequences generated for a given taxon are aligned to consensus sequences in the PHG to identify the haplotype node at a given anchor. All the anchors for a given taxon are processed through a Hidden Markov Model (HMM) to identify the most likely path through the graph. Path information is used to identify the variants (SNPs). Low cost sequencing technologies, coupled with the PHG, facilitate the genotyping of large number of samples to increase the size of training populations for genomic selection models. This can in turn increase predictive accuracy and selection intensity in a breeding program.

Detailed documentation and source code can be found on our website:

https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/Home

Motivation

The main goal of developing this package is to construct an R-based front-end to build and interact with the PHG API that implements commonly used Biocondcutor classes, data structures, and accessor methods for downstream analysis and integration with other packages.

Installation and Preliminary Steps

Which version of R should I be using?

It is highly recommended that you have R (v3.6.0 or higher) installed to avoid any potential installation problems with package dependencies.

Prerequisites - installing rJava

Since the PHG is written in Java, Java JDK will need to be installed on your machine. Additionally, for R to communicate with Java, the R package rJava will need to be installed. Detailed information can be found using the following links, depending on your OS:

Problems with rJava if you have upgraded Java

macOS - you have upgraded Java and rJava no longer works...

If you previously had rJava working through RStudio, then you upgraded your Java and it now longer works, try the following:

At the command line type:

R CMD javareconf

Then check for a left over symbolic link via:

ls -ltr /usr/local/lib/libjvm.dylib

If the link exists, remove it, then create it fresh via these commands:

rm /usr/local/lib/libjvm.dylib
sudo ln -s $(/usr/libexec/java_home)/lib/server/libjvm.dylib /usr/local/lib

You should now be able to enter RStudio and setup rJava.

macOS - clang: error: unsupported option '-fopenmp'

Change clang paths to gcc (via Homebrew). Add the following to your .bash_profile file:

export LD=/usr/local/bin/gcc-9
export CC=/usr/local/bin/gcc-9
export CXX=/usr/local/bin/g++-9
export CPP=/usr/local/bin/cpp-9

alias c++=/usr/local/bin/c++-9
alias g++=/usr/local/bin/g++-9
alias gcc=/usr/local/bin/gcc-9
alias cpp=/usr/local/bin/cpp-9
alias ld=/usr/local/bin/gcc-9
alias cc=/usr/local/bin/gcc-9
NOTE: Make sure your gcc version that you have is 9.

Installing rPHG source from BitBucket

After you have rJava up and running on your machine, install rPHG by obtaining the source code from our BitBucket repository:

if (!require("devtools")) install.packages("devtools")
devtools::install_bitbucket(
    repo = "bucklerlab/rphg",
    ref = "master"
)

Preliminary steps

Setting Memory

Since building the PHG may use an exceptional amount of computational resources, memory allocation to rPHG can be modified. To change the amount of memory, use the base options() function and modify the following parameter:

options(java.parameters = c("-Xmx<memory>", "-Xms<memory>"))

NOTE: Setting Java memory options for rPHG and any rJava-related packages needs to be set before loading the rPHG package!

Replace <memory> with a specified unit of memory. For example, if I want to allocate a maximum of 6 GB of memory for my operations, I would use the input "-Xmx6g", where g stands for gigabyte (GB). More information about memory allocation can be found here.

Loading rPHG

After source code has been compiled and optional Java memory options have been set, the package can be loaded into your environment using:

library(rPHG)

Or, if you want to use a function without violating your environment you can use rPHG::<function>, where <function> is an rPHG function.

The importance of logging your progress

Before we begin analyzing data, optional parameters can be set up to make rPHG more efficient in terms of debugging purposes. To prevent your R console from being overloaded with PHG logging information, it is highly recommended that you start a logging file. This file will house all of PHG’s loggingcoutput which is beneficial for debugging and tracking the progress of your analytical workflow. To start a logging file, use the following command:

rPHG::startLogger(fullPath = NULL, fileName = NULL)

If the rPHG::startLogger() parameters are set to NULL, the logging file will be created in your current working directory. If you are unsure of what your working directory is in R, use the base getwd() command.

Additionally, since this is a general walkthrough, certain intricacies of each function may glossed over. If you would like to study a function in full, refer to the R documentation by using ?<function> in the console, where <function> is an rPHG-based function.

Databases and Configuration Files

Overview

The idea behind the PHG is that in a given breeding program, all parental genotypes can be sequenced at high coverage, and loaded as parental haplotypes in a relational database. Progeny can then be sequenced at low coverage and used to infer which parental haplotypes/genotypes from the database are most likely present in a given progeny.

Database types

Currently, the PHG can use SQLite or PostgreSQL to store data for the pan-genomic graph. For more information about how data is stored within the database schema, please refer to our wiki.

Configuration files

Access to the PHG database, regardless of database type, requires a configuration file. This file contains various metadata needed to access the PHG db and calculate optimal graph paths:

  • host:port
  • username
  • password
  • database name
  • database type (sqlite or postgres)

An example database configuration can be found below:

SQLite example

host=localHost
user=sqlite
password=sqlite
DB=/tempFileDir/outputDir/phgTestDB_mapq48.db
DBtype=sqlite

PostgreSQL example

host=172.17.0.2:5432
user=postgres
password=mySecretPassword
DB=phgdb
DBtype=postgres

Database Access and Graph Building

Method calling

Once you have a PHG database and configuration file, you can proceed to the following steps. First, you can access all available PHG methods from the database using a path parameter to the database configuration file:

# Example path location (not run)
configPath <- "/home/bm646/Temporary/phg_tests/configSQLite.txt"

phgMethods <- rPHG::showPHGMethods(configFile = configPath)
phgMethods
## # A tibble: 7 x 5
##   method_id method_type type_name        method_name      description       
##       <int>       <int> <chr>            <chr>            <chr>             
## 1         1           1 ANCHOR_HAPLOTYP~ B73Ref_method    Test version for ~
## 2         2           7 REF_RANGE_GROUP  refRegionGroup   Group consists of~
## 3         3           7 REF_RANGE_GROUP  refInterRegionG~ Group consists of~
## 4         4           2 ASSEMBLY_HAPLOT~ mummer4          Assembly aligned ~
## 5         5           1 ANCHOR_HAPLOTYP~ GATK_PIPELINE    GATK_PIPELINE cre~
## 6         6           1 ANCHOR_HAPLOTYP~ CONSENSUS        CONSENSUS for cre~
## 7         7           1 ANCHOR_HAPLOTYP~ consensusTest    consensusTest;det~

The above object will produce a tibble-based data frame that will contain return method data from the PHG database. Method IDs and descriptions can be viewed for additional information detailing the methods available to use when building the PHG graph object.

Graph building

Next, we can build the optimal graph object. A single function, called graphBuilder, will be used with subsequent parameters:

phgObj <- graphBuilder(
    configFile = config_path,
    methods = "GATK_PIPELINE"
)

Where:

  • configFile: the path to our PHG database configuration file;
  • methods: one of the prior methods in the showMethods() function call, specifically, the method_name column (see prior output).

When the graph has finished building, we can then inspect the object:

phgObj
## class: PHGDataSet 
## dim: 20 6 
## metadata(1): jObj
## assays(1): hapID
## rownames(20): R1 R11 ... R10 R20
## rowData names(1): refRange_id
## colnames(6): LineA LineA1 ... Ref RefA1
## colData names(0):

When built, the object that is generated is of a PHGDataSet class. In the next section, we will discuss how to extract specific data from this object.

Accessing data

Overview

When called, the PHGDataSet shows the similar output to a SummarizedExperiment or RangedSummarizedExperiment class. More information about this commonly used Bioconductor class can be found here. This is due to the PHGDataSet class inheriting all methods and slot information from a SummarizedExperiment class. Therefore, we can use general accessor methods from either the SummarizedExperiment or S4Vectors packages. Some examples are as follows:

  • assay()
  • colData()
  • rowData()
  • rowRanges()
  • metadata()

A more detailed example of how these methods can be used in a PHGDataSet are shown below:

The layout of a PHGDataSet class.

The three main data types that can be extracted is:

  • rowRanges(); rowData(): reference range data
  • colData(); colnames(): taxa (i.e. genotype) data
  • assay(): haplotype or path ID data in matrix form
  • metadata(): access to the PHG API Java object (for advanced use only)

Getting reference range data

Reference range data can be accessed via the rowRanges() method. To use this, simply call this function on the prior PHG object (e.g. phgObj from our prior examples):

# Make a `GRanges` object
rr <- SummarizedExperiment::rowRanges(phgObj)
rr
## GRanges object with 20 ranges and 1 metadata column:
##       seqnames      ranges strand | refRange_id
##          <Rle>   <IRanges>  <Rle> |    <factor>
##    R1        1      1-3500      * |          R1
##   R11        1   3501-7500      * |         R11
##    R2        1  7501-11000      * |          R2
##   R12        1 11001-15000      * |         R12
##    R3        1 15001-18500      * |          R3
##   ...      ...         ...    ... .         ...
##   R18        1 56001-60000      * |         R18
##    R9        1 60001-63500      * |          R9
##   R19        1 63501-67500      * |         R19
##   R10        1 67501-71000      * |         R10
##   R20        1 71001-75000      * |         R20
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

What has been extracted is an object of GRanges class. This is another commonly used class in the “Bioconductor universe” to represent genomic intervals. More information about this package can be found here. Within this objecct, we get specified reference range information:

  • seqnames: the chromosome from which this reference range is found on
  • ranges: an IRanges object that contains reference range coordinates
  • strand: which strand for each reference range (+, -, *)
  • refRange_id: given reference range ID within the PHG database

From this object, we can then use additional methods to extract more specific details such as start/stop coordinates and width of each reference range:

GenomicRanges::ranges(rr)
IRanges object with 20 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
   R1         1      3500      3500
  R11      3501      7500      4000
   R2      7501     11000      3500
  R12     11001     15000      4000
   R3     15001     18500      3500
  ...       ...       ...       ...
  R18     56001     60000      4000
   R9     60001     63500      3500
  R19     63501     67500      4000
  R10     67501     71000      3500
  R20     71001     75000      4000

Due to the wonderful world of “inheritance”, we can also call this function directly on SummarizedExperiment objects and therefore our main PHGDataSet object:

SummarizedExperiment::ranges(phgObj)
IRanges object with 20 ranges and 0 metadata columns:
          start       end     width
      <integer> <integer> <integer>
   R1         1      3500      3500
  R11      3501      7500      4000
   R2      7501     11000      3500
  R12     11001     15000      4000
   R3     15001     18500      3500
  ...       ...       ...       ...
  R18     56001     60000      4000
   R9     60001     63500      3500
  R19     63501     67500      4000
  R10     67501     71000      3500
  R20     71001     75000      4000

Getting haplotype data

Additionally, haplotype identifiers can also be acessed from this data object. This is obtained using the assays() method, calling specifically, the haplotype ID matrix ($hapID):

SummarizedExperiment::assays(phgObj)$hapID
##     LineA LineA1 LineB LineB1 Ref RefA1
## R1     41    101    61    121  21    81
## R11    51    111    71    131  31    91
## R2     42    102    62    122  22    82
## R12    52    112    72    132  32    92
## R3     43    103    63    123  23    83
## R13    53    113    73    133  33    93
## R4     44    104    64    124  24    84
## R14    54    114    74    134  34    94
## R5     45    105    65    125  25    85
## R15    55    115    75    135  35    95
## R6     46    106    66    126  26    86
## R16    56    116    76    136  36    96
## R7     47    107    67    127  27    87
## R17    57    117    77    137  37    97
## R8     48    108    68    128  28    88
## R18    58    118    78    138  38    98
## R9     49    109    69    129  29    89
## R19    59    119    79    139  39    99
## R10    50    110    70    130  30    90
## R20    60    120    80    140  40   100

The above matrix represents a haplotype ID for each taxa (column names) within the PHG database at each given reference range (row names).

Access Java data

(NOTE: for advanced use only!) To access Java object data for the PHG API, the metadata() method from S4Vectors can be used. Doing so will return a Java reference object that can be used for defining you own potential functions and methods that do not exist within the rPHG package:

S4Vectors::metadata(phgObj)$jObj
## [1] "Java-Object{net.maizegenetics.pangenome.api.HaplotypeGraph@f381794}"

Summary Functions

Overview

Additionally, rPHG currently has several summary- and visualization-based functions. These include ways to determine levels of similarity between reference ranges and taxa.

Retrieve read mapping information from a PHG database

# Example path location (not run)
configPath <- "/home/bm646/Temporary/phg_tests/configSQLite.txt"

# Retrieve data object
rmDF <- rPHG::readMappingsForLineName(
    configFile = configPath,
    lineName = "RefA1_gbs",
    readMappingMethodName = "HAP_COUNT_METHOD",
    haplotypeMethodName = "CONSENSUS"
)

# Inspect `DataFrame` object
rmDF
## DataFrame with 30 rows and 7 columns
##     ref_range_id    chrom     start     hapid         taxa taxa_count range_count
##     <integer>    <factor> <integer> <integer>     <factor>  <integer>   <integer>
## 1           8           1     45501        81    RefA1,Ref          0           0
## 2           8           1     45501        82 LineB1,LineB          0           0
## 3           8           1     45501        83 LineA,LineA1          0           0
## 4           3           1     13001        84 LineB1,LineB          0           0
## 5           3           1     13001        85    RefA1,Ref          0           0
## ...       ...         ...       ...       ...          ...        ...         ...
## 26          5           1     26001       106 LineB1,LineB          0           3
## 27          5           1     26001       107    RefA1,Ref          3           3
## 28          4           1     19501       108    RefA1,Ref          1           1
## 29          4           1     19501       109 LineA,LineA1          0           1
## 30          4           1     19501       110 LineB1,LineB          0           1

This object will return two significant columns (range_count and taxa_count). range_count is the number of reads mapping to a range while taxa_count is the number of those reads mapping to a specific taxon in the range. taxa_count <= range_count.

Return a matrix for all the paths for pathMethod

# Example path location (not run)
configPath <- "/home/bm646/Temporary/phg_tests/configSQLite.txt"

# Get path matrix
pathMet <- rPHG::pathsForMethod(
    configFile = configPath,
    pathMethod = "PATH_METHOD"
)

# Inspect head of matrix
head(pathMet)
##              1  2  3   4   5  6  7  8  9  10
## LineA1_gbs 103 95 86 109 105 96 88 83 91  99
## LineA1_wgs 103 95 86 109 105 96 88 83 91  99
## LineA_gbs  103 95 86 109 105 96 88 83 91  99
## LineA_wgs  103 95 86 109 105 96 88 83 91  99
## LineB1_gbs 104 94 84 110 106 97 87 82 90 101
## LineB1_wgs 104 94 84 110 106 97 87 82 90 101

Determine numbers of haplotypes per reference range

To see the total number of haplotypes at each given reference range, the function, numHaploPerRange() can be used:

rPHG::numHaploPerRange(phgObject = phgObj)
## DataFrame with 20 rows and 5 columns
##           names     start       end     width numHaplotypes
##     <character> <integer> <integer> <integer>     <integer>
## 1            R1         1      3500      3500             6
## 2           R11      3501      7500      4000             6
## 3            R2      7501     11000      3500             6
## 4           R12     11001     15000      4000             6
## 5            R3     15001     18500      3500             6
## ...         ...       ...       ...       ...           ...
## 16          R18     56001     60000      4000             6
## 17           R9     60001     63500      3500             6
## 18          R19     63501     67500      4000             6
## 19          R10     67501     71000      3500             6
## 20          R20     71001     75000      4000             6

The data that is returned will contain each reference range coordinate, IDs, and the number of haplotypes per range (numHaplotypes). This data can also be passed to a plotting function, plotNumHaplo() which will display the prior data as a linear genomic plot. If you are familiar with pipe (%>%) operators in R via the magrittr package, this pipeline from PHGDataSet to visualization can be used as well:

library(magrittr)

# (1) Non-pipe example
haploPlot <- plotNumHaplo(numHaploPerRange(phgObject = phgObj))

# (2) Pipe example. Need to load `magrittr` package first!
haploPlot <- phgObj %>%
    rPHG::numHaploPerRange() %>%
    rPHG::plotNumHaplo()

# Return visualization
haploPlot

Visualize haplotype numbers per reference range.

Each reference range is displayed as an alternating blue-colored rectangle based on its position within the genome while the y-axis denotes the total number of unique haplotype IDs per reference range.

Visualize mutual information between pairs of ranges

plotMutualInfo() plots the “amount of mutual information” obtained about one random variable through observing the other random variable. The higher the numeric value, the more related taxa for a given reference range.

For this function, a PHGDataSet and reference range IDs in the form of a vector (e.g. c("R1", "R2", "R3")) are needed. This function can also be piped using the %>% operator from magrittr.

library(magrittr)

phgObj %>%
    rPHG::plotMutualInfo(
      refRanges = assays(phgObj)$hapID %>% 
        rownames()
    )

Visualize reference range “relatedness”.

BrAPI access

Overview

As of October 2021, maize-related PHG data can be accessed via the web using the Breeding Application Interface (BrAPI). In this section, we will go over how you can retrieve such data.

The BrapiCon object

To interface with the given BrAPI endpoints, a BrAPI connection object (BrapiCon) must be made. This object will the central component to how we retrieve data:

myCon <- BrapiCon("cbsudc01.biohpc.cornell.edu"); myCon
A BrAPI connection object
  Server...........: cbsudc01.biohpc.cornell.edu 
  Port.............: 80 
  Server status....: 200 
  BrAPI version....: v2

This object can take several parameters. The primary argument is the host (i.e. the web address) to where the PHG data is located. Currently it is located at cbsudc01.biohpc.cornell.edu. If you have custom port access and/or specific protocols (e.g. http or https), this will need to be specified with the port and protocol arguments, respectively. In the next few sections, we will see how this object is used with other rPHG methods using the %>% operator from magrittr.

Retrieve available PHG methods

To see what available methods are present in the PHG web service, we can use the method availablePHGMethods(). Simply "pipe" (e.g. %>%) the BrapiCon object into this method:

myCon %>% availablePHGMethods()

This will return a simple dataframe of method names and the dimensionality of each data set:

# A tibble: 11 x 4
   variantTableDbId                           numVariants numSamples additionalInfo
   <chr>                                            <int>      <int> <lgl>         
 1 v5_gffGenes                                      57790          1 NA            
 2 Ames_NonMergedReadMapping_AllLines_Haploid       57725       5532 NA            
 3 mummer4_released_assemblies                      57671          4 NA            
 4 mummer4_stitched_assemblies                      57775         12 NA            
 5 NonMergedReadMapping_AllNamParents_Haploid       57598       6264 NA            
 6 mummer4_NAM_released_assemblies                  57790         26 NA            
 7 ALL_MAIZE_1_COLLAPSED                            57790         53 NA            
 8 mummer4_gap_filled_assemblies                    57775         11 NA            
 9 MergedReadMapping_AllNamParents_Haploid          57598       4984 NA            
10 Ames_MergedReadMapping_AllLines_Haploid          57725       3327 NA            
11 mummer4_PATH                                     57790         53 NA

Reading in data to R memory

Now that we have an idea what PHG methods are available, we can specify a method argument to another rPHG method: PHGMethod(). This object is an intermediate "pointer" between the core BrAPI interface and a given PHG method. This will dictate how our read* methods will work. A PHG data set is made up of three primary components:

  • samples (e.g. taxa)
  • reference ranges (e.g. positions)
  • feature data (e.g. a haplotype ID or index)

Reading samples

To read samples, we can use the readSamples() method. This will point PHG method URL to the samples endpoint and retrieve and parse data into R as a data.frame object:

myMethod <- "NonMergedReadMapping_AllNamParents_Haploid"
myCon %>% 
    PHGMethod(myMethod) %>%
    readSamples()
# A tibble: 6,264 x 4
   sampleName            sampleDbId sampleDescription            additionalInfo
   <chr>                 <chr>      <chr>                        <lgl>         
 1 Z001E0001-628NHAAXX_1 5039       genotype for adding the path NA            
 2 Z001E0001-D10RTACXX_5 5040       genotype for adding the path NA            
 3 Z001E0002-628NHAAXX_1 5041       genotype for adding the path NA            
 4 Z001E0002-D10RTACXX_5 5042       genotype for adding the path NA            
 5 Z001E0003-628NHAAXX_1 5043       genotype for adding the path NA            
 6 Z001E0003-D10RTACXX_5 5044       genotype for adding the path NA            
 7 Z001E0004-628NHAAXX_2 5045       genotype for adding the path NA            
 8 Z001E0004-D10RTACXX_5 5046       genotype for adding the path NA            
 9 Z001E0005-628NHAAXX_2 5047       genotype for adding the path NA            
10 Z001E0005-D10RTACXX_5 5048       genotype for adding the path NA            
# ... with 6,254 more rows

Reading reference ranges

To read reference range data, we can use the readRefRanges() method. This will point PHG method URL to the variants endpoint and retrieve and parse data into R as a GRanges object:

myCon %>% 
    PHGMethod(myMethod) %>%
    readRefRanges()
GRanges object with 57598 ranges and 1 metadata column:
          seqnames              ranges strand | variantDbId
             <Rle>           <IRanges>  <Rle> |   <numeric>
      [1]        1             1-34616      * |           1
      [2]        1         34617-40204      * |           2
      [3]        1         40205-41213      * |           3
      [4]        1         41214-46762      * |           4
      [5]        1        46763-108553      * |           5
      ...      ...                 ...    ... .         ...
  [57594]       10 152012585-152012892      * |       57786
  [57595]       10 152012893-152015823      * |       57787
  [57596]       10 152015824-152200924      * |       57788
  [57597]       10 152200925-152201511      * |       57789
  [57598]       10 152201512-152435371      * |       57790
  -------
  seqinfo: 10 sequences from an unspecified genome; no seqlengths

From this GRanges object, we return the usual sequence name (e.g. chromosome) info along with the range start and stop locations that are 1-based. Additionally, one metadata column will be returned that indicates the reference range database ID (e.g. variantDbId) that can used to link this data up with feature information described below.

Read feature information

In the last two sections, we have shown how you can retrieve row and column data respectively. To read in the actual feature data (e.g. the cell information of the table), we can use the readTable() method. This function has one argument which specifies if you would want to return the actual haplotype ID information stored in the PHG database or the indexes of the given haplotype data for a given reference range. This can be specified with the index parameter:

## Read in haplotype indexes ----
myCon %>% 
    PHGMethod(myMethod) %>%
    readTable(index = TRUE) -> phgTableIndex

phgTableIndex[1:10, 1:4]
                      23141 23142 23143 23144
Z001E0007-628NHAAXX_2    26    26    26    26
Z001E0007-D10RTACXX_5    26    26    26    26
Z001E0010-D10RTACXX_5     1     1     1     1
Z001E0011-D10RTACXX_5    26    26    26    26
Z001E0012-D10RTACXX_5    26    26    26    26
Z001E0014-D10RTACXX_5     1     1     1     1
Z001E0016-D10RTACXX_5    26    26    26    26
Z001E0018-D10RTACXX_5    26    26    26    26
Z001E0019-D10RTACXX_5    26    26    26    26
Z001E0020-628NJAAXX_5    26    26    26    26
## Read in haplotype IDs ----
myCon %>% 
    PHGMethod(myMethod) %>%
    readTable() -> phgTableIds

phgTableIds[1:10, 1:4]
                        23141   23142   23143   23144
Z001E0007-628NHAAXX_2 1726472 1726473 1726474 1726475
Z001E0007-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0010-D10RTACXX_5   65540   65541   65542   65543
Z001E0011-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0012-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0014-D10RTACXX_5   65540   65541   65542   65543
Z001E0016-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0018-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0019-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0020-628NJAAXX_5 1726472 1726473 1726474 1726475

From the prior tables, a matrix object will be returned. Row names are taxa and column names are reference range database IDs. This data can be linked up with prior sample and reference range information.

Note: While the PHG is focused on reference ranges instead of SNP-based locations, matrix size can still be problematic on machines with limited resources. Currently, the largest maize PHG data set has a size of 6264 samples and 57598 reference ranges.

Note: Since this is web-based, data will be retrieved as URL requests, so optimal high-speed internet is recommended.

Filtering data (overview)

For certains use-cases, a user is not interested in the entire PHG data set, but merely a "snapshot" of the data at given location (filtered reference ranges) and/or given taxa (filtered samples). To filter data, we can use the methods filterRefRanges() and filterSamples() in conjunction with our read methods.

Filtering reference ranges

Filtering reference ranges can be specified in several ways. The main use case is to pass range-based locations of interest as a GRanges object with the gr parameter:

## Example query
queryRanges <- GRanges(
    seqnames = c("4"),
    ranges = IRanges(
        start = c(43300000),
        end = c(43450000)
    )
)

## Use GRanges query for filter
myCon %>%
    PHGMethod(myMethod) %>%
    filterRefRanges(gr = queryRanges) %>%
    readRefRanges()

We can also specify a whole chromosome or set of chromosomes if we wish:

## Select a whole chromosome(s)
myCon %>%
    PHGMethod(myMethod) %>%
    filterRefRanges(chromosome = c(1, 5, 10)) %>%
    readRefRanges()

Or we can specify a given series of locations with vectors if you don't want to deal with GRanges:

myCon %>%
    PHGMethod(myMethod) %>%
    filterRefRanges(
        chromosome = c(1, 5, 10), 
        start = c(100, 140, 150), 
        end = c(10000, 30000, 35000)
    ) %>%
    readRefRanges()

And we can also specify a combination of both GRanges objects and whole chromosomes if we wish:

myCon %>%
    PHGMethod(myMethod) %>%
    filterRefRanges(
        gr = queryRanges,
        chromosome = c(1, 5, 10)
    ) %>%
    readRefRanges()

Filtering samples

We can filter on select samples by passing a vector of taxa as character objects using the filterSamples() method:

querySamples <- c(
    "Z001E0066-628NHAAXX_1", 
    "Z003E0197-627C3AAXX_3",
    "Z003E0160-705VVAAXX_3"
)

myCon %>%
    PHGMethod(myMethod) %>%
    filterSamples(querySamples) %>%
    readSamples()

Filtering on both reference ranges and samples

Finally, we can add everything together and query whole tables. This will filter on both rows (taxa) and columns (reference ranges):

myCon %>%
    PHGMethod(myMethod) %>%
    filterRefRanges(
        gr = queryRanges,
        chromosome = c(1, 5, 10)
    ) %>%
    filterSamples(querySamples) %>%
    readTable()

How to ask for help and report bugs

Asking questions

Please use Biostars to ask for help. Instructions for using Biostars are here

PHG Workshop (June 4-8, 2018): Cornell University - Ithaca

Agenda

Presentations

PHG Workshop (August 17-18, 2018): IRRI - Philippines

Agenda

Presentations

Updated