Wiki
Clone wikirPHG / Home
Getting Started with rPHG
Author: Brandon Monier
This page will soon be moving!
This wiki will be moving to a new location on February 10, 2023. Be on the lookout for https://maize-genetics.github.io/rPHG/
Table of Contents
- Getting Started with rPHG
- This page will soon be moving!
- Table of Contents
- Introduction
- Installation and Preliminary Steps
- Databases and Configuration Files
- Database Access and Graph Building
- Accessing data
- Summary Functions
- BrAPI access
- How to ask for help and report bugs
NOTE: If you are using Java 16+ or Java 8 (1.8), you may run into errors when trying pull reference ranges and table data from the BrAPI client! This software has been built and tested on Java 11.
Introduction
Overview
Thanks for checking out rPHG! In this document, we will go over the functionalities used to work with the practical haplotype graph (PHG) API via R.
The PHG is a trellis graph based representation of genic and intergenic regions (called reference ranges or reference intervals) which represent diversity across and between taxa. It can be used to: create custom genomes for alignment, call rare alleles, impute genotypes, and efficiently store genomic data from many lines (i.e. reference, assemblies, and other lines). Skim sequences generated for a given taxon are aligned to consensus sequences in the PHG to identify the haplotype node at a given anchor. All the anchors for a given taxon are processed through a Hidden Markov Model (HMM) to identify the most likely path through the graph. Path information is used to identify the variants (SNPs). Low cost sequencing technologies, coupled with the PHG, facilitate the genotyping of large number of samples to increase the size of training populations for genomic selection models. This can in turn increase predictive accuracy and selection intensity in a breeding program.
Detailed documentation and source code can be found on our website:
https://bitbucket.org/bucklerlab/practicalhaplotypegraph/wiki/Home
Motivation
The main goal of developing this package is to construct an R-based front-end to build and interact with the PHG API that implements commonly used Biocondcutor classes, data structures, and accessor methods for downstream analysis and integration with other packages.
Installation and Preliminary Steps
Which version of R should I be using?
It is highly recommended that you have R (v3.6.0 or higher) installed to avoid any potential installation problems with package dependencies.
Prerequisites - installing rJava
Since the PHG is written in Java, Java JDK will need to be installed on
your machine. Additionally, for R to communicate with Java, the R
package rJava
will need to be installed. Detailed information can be
found using the following links, depending on your OS:
Problems with rJava if you have upgraded Java
macOS - you have upgraded Java and rJava no longer works...
If you previously had rJava
working through RStudio, then you upgraded your Java and it now longer works, try the following:
At the command line type:
R CMD javareconf
Then check for a left over symbolic link via:
ls -ltr /usr/local/lib/libjvm.dylib
If the link exists, remove it, then create it fresh via these commands:
rm /usr/local/lib/libjvm.dylib
sudo ln -s $(/usr/libexec/java_home)/lib/server/libjvm.dylib /usr/local/lib
You should now be able to enter RStudio and setup rJava.
macOS - clang: error: unsupported option '-fopenmp'
Change clang
paths to gcc
(via Homebrew). Add the following to your .bash_profile
file:
export LD=/usr/local/bin/gcc-9 export CC=/usr/local/bin/gcc-9 export CXX=/usr/local/bin/g++-9 export CPP=/usr/local/bin/cpp-9 alias c++=/usr/local/bin/c++-9 alias g++=/usr/local/bin/g++-9 alias gcc=/usr/local/bin/gcc-9 alias cpp=/usr/local/bin/cpp-9 alias ld=/usr/local/bin/gcc-9 alias cc=/usr/local/bin/gcc-9
gcc
version that you have is 9
.
Installing rPHG source from BitBucket
After you have rJava
up and running on your machine, install rPHG
by
obtaining the source code from our BitBucket repository:
if (!require("devtools")) install.packages("devtools") devtools::install_bitbucket( repo = "bucklerlab/rphg", ref = "master" )
Preliminary steps
Setting Memory
Since building the PHG may use an exceptional amount of computational
resources, memory allocation to rPHG
can be modified. To change the
amount of memory, use the base options()
function and modify the
following parameter:
options(java.parameters = c("-Xmx<memory>", "-Xms<memory>"))
NOTE: Setting Java memory options for rPHG
and any rJava
-related
packages needs to be set before loading the rPHG
package!
Replace <memory>
with a specified unit of memory. For example, if I
want to allocate a maximum of 6 GB of memory for my operations, I would
use the input "-Xmx6g"
, where g
stands for gigabyte (GB). More
information about memory allocation can be found
here.
Loading rPHG
After source code has been compiled and optional Java memory options have been set, the package can be loaded into your environment using:
library(rPHG)
Or, if you want to use a function without violating your environment you
can use rPHG::<function>
, where <function>
is an rPHG
function.
The importance of logging your progress
Before we begin analyzing data, optional parameters can be set up to
make rPHG
more efficient in terms of debugging purposes. To prevent
your R console from being overloaded with PHG logging information,
it is highly recommended that you start a logging file. This
file will house all of PHG’s loggingcoutput which is beneficial for
debugging and tracking the progress of your analytical workflow. To
start a logging file, use the following command:
rPHG::startLogger(fullPath = NULL, fileName = NULL)
If the rPHG::startLogger()
parameters are set to NULL
, the logging
file will be created in your current working directory. If you are
unsure of what your working directory is in R, use the base getwd()
command.
Additionally, since this is a general walkthrough, certain intricacies
of each function may glossed over. If you would like to study a function
in full, refer to the R documentation by using ?<function>
in the
console, where <function>
is an rPHG
-based function.
Databases and Configuration Files
Overview
The idea behind the PHG is that in a given breeding program, all parental genotypes can be sequenced at high coverage, and loaded as parental haplotypes in a relational database. Progeny can then be sequenced at low coverage and used to infer which parental haplotypes/genotypes from the database are most likely present in a given progeny.
Database types
Currently, the PHG can use SQLite or PostgreSQL to store data for the pan-genomic graph. For more information about how data is stored within the database schema, please refer to our wiki.
Configuration files
Access to the PHG database, regardless of database type, requires a configuration file. This file contains various metadata needed to access the PHG db and calculate optimal graph paths:
- host:port
- username
- password
- database name
- database type (sqlite or postgres)
An example database configuration can be found below:
SQLite example
host=localHost
user=sqlite
password=sqlite
DB=/tempFileDir/outputDir/phgTestDB_mapq48.db
DBtype=sqlite
PostgreSQL example
host=172.17.0.2:5432
user=postgres
password=mySecretPassword
DB=phgdb
DBtype=postgres
Database Access and Graph Building
Method calling
Once you have a PHG database and configuration file, you can proceed to the following steps. First, you can access all available PHG methods from the database using a path parameter to the database configuration file:
# Example path location (not run) configPath <- "/home/bm646/Temporary/phg_tests/configSQLite.txt" phgMethods <- rPHG::showPHGMethods(configFile = configPath) phgMethods
## # A tibble: 7 x 5
## method_id method_type type_name method_name description
## <int> <int> <chr> <chr> <chr>
## 1 1 1 ANCHOR_HAPLOTYP~ B73Ref_method Test version for ~
## 2 2 7 REF_RANGE_GROUP refRegionGroup Group consists of~
## 3 3 7 REF_RANGE_GROUP refInterRegionG~ Group consists of~
## 4 4 2 ASSEMBLY_HAPLOT~ mummer4 Assembly aligned ~
## 5 5 1 ANCHOR_HAPLOTYP~ GATK_PIPELINE GATK_PIPELINE cre~
## 6 6 1 ANCHOR_HAPLOTYP~ CONSENSUS CONSENSUS for cre~
## 7 7 1 ANCHOR_HAPLOTYP~ consensusTest consensusTest;det~
The above object will produce a tibble
-based data frame that will
contain return method data from the PHG database. Method IDs and
descriptions can be viewed for additional information detailing the
methods available to use when building the PHG graph object.
Graph building
Next, we can build the optimal graph object. A single function, called
graphBuilder
, will be used with subsequent parameters:
phgObj <- graphBuilder( configFile = config_path, methods = "GATK_PIPELINE" )
Where:
configFile
: the path to our PHG database configuration file;methods
: one of the prior methods in theshowMethods()
function call, specifically, themethod_name
column (see prior output).
When the graph has finished building, we can then inspect the object:
phgObj
## class: PHGDataSet
## dim: 20 6
## metadata(1): jObj
## assays(1): hapID
## rownames(20): R1 R11 ... R10 R20
## rowData names(1): refRange_id
## colnames(6): LineA LineA1 ... Ref RefA1
## colData names(0):
When built, the object that is generated is of a PHGDataSet
class. In
the next section, we will discuss how to extract specific data from this
object.
Accessing data
Overview
When called, the PHGDataSet
shows the similar output to a
SummarizedExperiment
or RangedSummarizedExperiment
class. More
information about this commonly used Bioconductor class can be found
here.
This is due to the PHGDataSet
class inheriting all methods and slot
information from a SummarizedExperiment
class. Therefore, we can use
general accessor methods from either the SummarizedExperiment
or
S4Vectors
packages. Some examples are as follows:
assay()
colData()
rowData()
rowRanges()
metadata()
A more detailed example of how these methods can be used in a
PHGDataSet
are shown below:
The three main data types that can be extracted is:
rowRanges()
;rowData()
: reference range datacolData()
;colnames()
: taxa (i.e. genotype) dataassay()
: haplotype or path ID data in matrix formmetadata()
: access to the PHG API Java object (for advanced use only)
Getting reference range data
Reference range data can be accessed via the rowRanges()
method. To
use this, simply call this function on the prior PHG object
(e.g. phgObj
from our prior examples):
# Make a `GRanges` object rr <- SummarizedExperiment::rowRanges(phgObj) rr
## GRanges object with 20 ranges and 1 metadata column:
## seqnames ranges strand | refRange_id
## <Rle> <IRanges> <Rle> | <factor>
## R1 1 1-3500 * | R1
## R11 1 3501-7500 * | R11
## R2 1 7501-11000 * | R2
## R12 1 11001-15000 * | R12
## R3 1 15001-18500 * | R3
## ... ... ... ... . ...
## R18 1 56001-60000 * | R18
## R9 1 60001-63500 * | R9
## R19 1 63501-67500 * | R19
## R10 1 67501-71000 * | R10
## R20 1 71001-75000 * | R20
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
What has been extracted is an object of GRanges
class. This is another
commonly used class in the “Bioconductor universe” to represent genomic
intervals. More information about this package can be found
here.
Within this objecct, we get specified reference range information:
seqnames
: the chromosome from which this reference range is found onranges
: anIRanges
object that contains reference range coordinatesstrand
: which strand for each reference range (+
,-
,*
)refRange_id
: given reference range ID within the PHG database
From this object, we can then use additional methods to extract more specific details such as start/stop coordinates and width of each reference range:
GenomicRanges::ranges(rr)
IRanges object with 20 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
R1 1 3500 3500
R11 3501 7500 4000
R2 7501 11000 3500
R12 11001 15000 4000
R3 15001 18500 3500
... ... ... ...
R18 56001 60000 4000
R9 60001 63500 3500
R19 63501 67500 4000
R10 67501 71000 3500
R20 71001 75000 4000
Due to the wonderful world of “inheritance”, we can also call this
function directly on SummarizedExperiment
objects and therefore our
main PHGDataSet
object:
SummarizedExperiment::ranges(phgObj)
IRanges object with 20 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
R1 1 3500 3500
R11 3501 7500 4000
R2 7501 11000 3500
R12 11001 15000 4000
R3 15001 18500 3500
... ... ... ...
R18 56001 60000 4000
R9 60001 63500 3500
R19 63501 67500 4000
R10 67501 71000 3500
R20 71001 75000 4000
Getting haplotype data
Additionally, haplotype identifiers can also be acessed from this data
object. This is obtained using the assays()
method, calling
specifically, the haplotype ID matrix ($hapID
):
SummarizedExperiment::assays(phgObj)$hapID
## LineA LineA1 LineB LineB1 Ref RefA1
## R1 41 101 61 121 21 81
## R11 51 111 71 131 31 91
## R2 42 102 62 122 22 82
## R12 52 112 72 132 32 92
## R3 43 103 63 123 23 83
## R13 53 113 73 133 33 93
## R4 44 104 64 124 24 84
## R14 54 114 74 134 34 94
## R5 45 105 65 125 25 85
## R15 55 115 75 135 35 95
## R6 46 106 66 126 26 86
## R16 56 116 76 136 36 96
## R7 47 107 67 127 27 87
## R17 57 117 77 137 37 97
## R8 48 108 68 128 28 88
## R18 58 118 78 138 38 98
## R9 49 109 69 129 29 89
## R19 59 119 79 139 39 99
## R10 50 110 70 130 30 90
## R20 60 120 80 140 40 100
The above matrix represents a haplotype ID for each taxa (column names) within the PHG database at each given reference range (row names).
Access Java data
(NOTE: for advanced use only!) To access Java object data for the
PHG API, the metadata()
method from S4Vectors
can be used. Doing so
will return a Java reference object that can be used for defining you
own potential functions and methods that do not exist within the rPHG
package:
S4Vectors::metadata(phgObj)$jObj
## [1] "Java-Object{net.maizegenetics.pangenome.api.HaplotypeGraph@f381794}"
Summary Functions
Overview
Additionally, rPHG
currently has several summary- and
visualization-based functions. These include ways to determine levels of
similarity between reference ranges and taxa.
Retrieve read mapping information from a PHG database
# Example path location (not run) configPath <- "/home/bm646/Temporary/phg_tests/configSQLite.txt" # Retrieve data object rmDF <- rPHG::readMappingsForLineName( configFile = configPath, lineName = "RefA1_gbs", readMappingMethodName = "HAP_COUNT_METHOD", haplotypeMethodName = "CONSENSUS" ) # Inspect `DataFrame` object rmDF
## DataFrame with 30 rows and 7 columns
## ref_range_id chrom start hapid taxa taxa_count range_count
## <integer> <factor> <integer> <integer> <factor> <integer> <integer>
## 1 8 1 45501 81 RefA1,Ref 0 0
## 2 8 1 45501 82 LineB1,LineB 0 0
## 3 8 1 45501 83 LineA,LineA1 0 0
## 4 3 1 13001 84 LineB1,LineB 0 0
## 5 3 1 13001 85 RefA1,Ref 0 0
## ... ... ... ... ... ... ... ...
## 26 5 1 26001 106 LineB1,LineB 0 3
## 27 5 1 26001 107 RefA1,Ref 3 3
## 28 4 1 19501 108 RefA1,Ref 1 1
## 29 4 1 19501 109 LineA,LineA1 0 1
## 30 4 1 19501 110 LineB1,LineB 0 1
This object will return two significant columns (range_count
and taxa_count
). range_count
is the number of reads mapping to a range while taxa_count
is the number of those reads mapping to a specific taxon in the range. taxa_count
<= range_count
.
Return a matrix for all the paths for pathMethod
# Example path location (not run) configPath <- "/home/bm646/Temporary/phg_tests/configSQLite.txt" # Get path matrix pathMet <- rPHG::pathsForMethod( configFile = configPath, pathMethod = "PATH_METHOD" ) # Inspect head of matrix head(pathMet)
## 1 2 3 4 5 6 7 8 9 10
## LineA1_gbs 103 95 86 109 105 96 88 83 91 99
## LineA1_wgs 103 95 86 109 105 96 88 83 91 99
## LineA_gbs 103 95 86 109 105 96 88 83 91 99
## LineA_wgs 103 95 86 109 105 96 88 83 91 99
## LineB1_gbs 104 94 84 110 106 97 87 82 90 101
## LineB1_wgs 104 94 84 110 106 97 87 82 90 101
Determine numbers of haplotypes per reference range
To see the total number of haplotypes at each given reference range, the
function, numHaploPerRange()
can be used:
rPHG::numHaploPerRange(phgObject = phgObj)
## DataFrame with 20 rows and 5 columns
## names start end width numHaplotypes
## <character> <integer> <integer> <integer> <integer>
## 1 R1 1 3500 3500 6
## 2 R11 3501 7500 4000 6
## 3 R2 7501 11000 3500 6
## 4 R12 11001 15000 4000 6
## 5 R3 15001 18500 3500 6
## ... ... ... ... ... ...
## 16 R18 56001 60000 4000 6
## 17 R9 60001 63500 3500 6
## 18 R19 63501 67500 4000 6
## 19 R10 67501 71000 3500 6
## 20 R20 71001 75000 4000 6
The data that is returned will contain each reference range coordinate,
IDs, and the number of haplotypes per range (numHaplotypes
). This data
can also be passed to a plotting function, plotNumHaplo()
which will
display the prior data as a linear genomic plot. If you are familiar
with pipe (%>%
) operators in R via the magrittr
package, this
pipeline from PHGDataSet
to visualization can be used as well:
library(magrittr) # (1) Non-pipe example haploPlot <- plotNumHaplo(numHaploPerRange(phgObject = phgObj)) # (2) Pipe example. Need to load `magrittr` package first! haploPlot <- phgObj %>% rPHG::numHaploPerRange() %>% rPHG::plotNumHaplo() # Return visualization haploPlot
Each reference range is displayed as an alternating blue-colored rectangle based on its position within the genome while the y-axis denotes the total number of unique haplotype IDs per reference range.
Visualize mutual information between pairs of ranges
plotMutualInfo()
plots the “amount of mutual information” obtained
about one random variable through observing the other random variable.
The higher the numeric value, the more related taxa for a given
reference range.
For this function, a PHGDataSet
and reference range IDs in the form of
a vector (e.g. c("R1", "R2", "R3")
) are needed. This function can also
be piped using the %>%
operator from magrittr
.
library(magrittr) phgObj %>% rPHG::plotMutualInfo( refRanges = assays(phgObj)$hapID %>% rownames() )
BrAPI access
Overview
As of October 2021, maize-related PHG data can be accessed via the web using the Breeding Application Interface (BrAPI). In this section, we will go over how you can retrieve such data.
The BrapiCon
object
To interface with the given BrAPI endpoints, a BrAPI connection object (BrapiCon
) must be made.
This object will the central component to how we retrieve data:
myCon <- BrapiCon("cbsudc01.biohpc.cornell.edu"); myCon
A BrAPI connection object
Server...........: cbsudc01.biohpc.cornell.edu
Port.............: 80
Server status....: 200
BrAPI version....: v2
This object can take several parameters. The primary argument is the host (i.e. the web address) to
where the PHG data is located. Currently it is located at cbsudc01.biohpc.cornell.edu
. If you
have custom port access and/or specific protocols (e.g. http
or https
), this will need to be
specified with the port
and protocol
arguments, respectively. In the next few sections, we will
see how this object is used with other rPHG
methods using the %>%
operator from magrittr
.
Retrieve available PHG methods
To see what available methods are present in the PHG web service, we can use the method
availablePHGMethods()
. Simply "pipe" (e.g. %>%
) the BrapiCon
object into this method:
myCon %>% availablePHGMethods()
This will return a simple dataframe of method names and the dimensionality of each data set:
# A tibble: 11 x 4
variantTableDbId numVariants numSamples additionalInfo
<chr> <int> <int> <lgl>
1 v5_gffGenes 57790 1 NA
2 Ames_NonMergedReadMapping_AllLines_Haploid 57725 5532 NA
3 mummer4_released_assemblies 57671 4 NA
4 mummer4_stitched_assemblies 57775 12 NA
5 NonMergedReadMapping_AllNamParents_Haploid 57598 6264 NA
6 mummer4_NAM_released_assemblies 57790 26 NA
7 ALL_MAIZE_1_COLLAPSED 57790 53 NA
8 mummer4_gap_filled_assemblies 57775 11 NA
9 MergedReadMapping_AllNamParents_Haploid 57598 4984 NA
10 Ames_MergedReadMapping_AllLines_Haploid 57725 3327 NA
11 mummer4_PATH 57790 53 NA
Reading in data to R memory
Now that we have an idea what PHG methods are available, we can specify a method argument to another
rPHG
method: PHGMethod()
. This object is an intermediate "pointer" between the core BrAPI interface
and a given PHG method. This will dictate how our read*
methods will work. A PHG data set is made
up of three primary components:
- samples (e.g. taxa)
- reference ranges (e.g. positions)
- feature data (e.g. a haplotype ID or index)
Reading samples
To read samples, we can use the readSamples()
method. This will point PHG method URL to the samples
endpoint and retrieve and parse data into R as a data.frame
object:
myMethod <- "NonMergedReadMapping_AllNamParents_Haploid" myCon %>% PHGMethod(myMethod) %>% readSamples()
# A tibble: 6,264 x 4
sampleName sampleDbId sampleDescription additionalInfo
<chr> <chr> <chr> <lgl>
1 Z001E0001-628NHAAXX_1 5039 genotype for adding the path NA
2 Z001E0001-D10RTACXX_5 5040 genotype for adding the path NA
3 Z001E0002-628NHAAXX_1 5041 genotype for adding the path NA
4 Z001E0002-D10RTACXX_5 5042 genotype for adding the path NA
5 Z001E0003-628NHAAXX_1 5043 genotype for adding the path NA
6 Z001E0003-D10RTACXX_5 5044 genotype for adding the path NA
7 Z001E0004-628NHAAXX_2 5045 genotype for adding the path NA
8 Z001E0004-D10RTACXX_5 5046 genotype for adding the path NA
9 Z001E0005-628NHAAXX_2 5047 genotype for adding the path NA
10 Z001E0005-D10RTACXX_5 5048 genotype for adding the path NA
# ... with 6,254 more rows
Reading reference ranges
To read reference range data, we can use the readRefRanges()
method. This will point PHG method URL to the variants
endpoint and retrieve and parse data into R as a GRanges
object:
myCon %>% PHGMethod(myMethod) %>% readRefRanges()
GRanges object with 57598 ranges and 1 metadata column:
seqnames ranges strand | variantDbId
<Rle> <IRanges> <Rle> | <numeric>
[1] 1 1-34616 * | 1
[2] 1 34617-40204 * | 2
[3] 1 40205-41213 * | 3
[4] 1 41214-46762 * | 4
[5] 1 46763-108553 * | 5
... ... ... ... . ...
[57594] 10 152012585-152012892 * | 57786
[57595] 10 152012893-152015823 * | 57787
[57596] 10 152015824-152200924 * | 57788
[57597] 10 152200925-152201511 * | 57789
[57598] 10 152201512-152435371 * | 57790
-------
seqinfo: 10 sequences from an unspecified genome; no seqlengths
From this GRanges
object, we return the usual sequence name (e.g. chromosome) info along
with the range start and stop locations that are 1-based. Additionally, one metadata
column will be returned that indicates the reference range database ID (e.g. variantDbId
)
that can used to link this data up with feature information described below.
Read feature information
In the last two sections, we have shown how you can retrieve row and column data respectively. To read in
the actual feature data (e.g. the cell information of the table), we can use the readTable()
method. This
function has one argument which specifies if you would want to return the actual haplotype ID information
stored in the PHG database or the indexes of the given haplotype data for a given reference range. This can
be specified with the index
parameter:
## Read in haplotype indexes ---- myCon %>% PHGMethod(myMethod) %>% readTable(index = TRUE) -> phgTableIndex phgTableIndex[1:10, 1:4]
23141 23142 23143 23144
Z001E0007-628NHAAXX_2 26 26 26 26
Z001E0007-D10RTACXX_5 26 26 26 26
Z001E0010-D10RTACXX_5 1 1 1 1
Z001E0011-D10RTACXX_5 26 26 26 26
Z001E0012-D10RTACXX_5 26 26 26 26
Z001E0014-D10RTACXX_5 1 1 1 1
Z001E0016-D10RTACXX_5 26 26 26 26
Z001E0018-D10RTACXX_5 26 26 26 26
Z001E0019-D10RTACXX_5 26 26 26 26
Z001E0020-628NJAAXX_5 26 26 26 26
## Read in haplotype IDs ---- myCon %>% PHGMethod(myMethod) %>% readTable() -> phgTableIds phgTableIds[1:10, 1:4]
23141 23142 23143 23144
Z001E0007-628NHAAXX_2 1726472 1726473 1726474 1726475
Z001E0007-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0010-D10RTACXX_5 65540 65541 65542 65543
Z001E0011-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0012-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0014-D10RTACXX_5 65540 65541 65542 65543
Z001E0016-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0018-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0019-D10RTACXX_5 1726472 1726473 1726474 1726475
Z001E0020-628NJAAXX_5 1726472 1726473 1726474 1726475
From the prior tables, a matrix
object will be returned. Row names are taxa and column names
are reference range database IDs. This data can be linked up with prior sample and reference
range information.
Note: While the PHG is focused on reference ranges instead of SNP-based locations, matrix size can still be problematic on machines with limited resources. Currently, the largest maize PHG data set has a size of 6264 samples and 57598 reference ranges.
Note: Since this is web-based, data will be retrieved as URL requests, so optimal high-speed internet is recommended.
Filtering data (overview)
For certains use-cases, a user is not interested in the entire PHG data set, but merely a
"snapshot" of the data at given location (filtered reference ranges) and/or given taxa
(filtered samples). To filter data, we can use the methods filterRefRanges()
and filterSamples()
in conjunction with our read methods.
Filtering reference ranges
Filtering reference ranges can be specified in several ways. The main use case is to pass
range-based locations of interest as a GRanges
object with the gr
parameter:
## Example query queryRanges <- GRanges( seqnames = c("4"), ranges = IRanges( start = c(43300000), end = c(43450000) ) ) ## Use GRanges query for filter myCon %>% PHGMethod(myMethod) %>% filterRefRanges(gr = queryRanges) %>% readRefRanges()
We can also specify a whole chromosome or set of chromosomes if we wish:
## Select a whole chromosome(s) myCon %>% PHGMethod(myMethod) %>% filterRefRanges(chromosome = c(1, 5, 10)) %>% readRefRanges()
Or we can specify a given series of locations with vectors if you don't want to
deal with GRanges
:
myCon %>% PHGMethod(myMethod) %>% filterRefRanges( chromosome = c(1, 5, 10), start = c(100, 140, 150), end = c(10000, 30000, 35000) ) %>% readRefRanges()
And we can also specify a combination of both GRanges
objects and whole chromosomes if
we wish:
myCon %>% PHGMethod(myMethod) %>% filterRefRanges( gr = queryRanges, chromosome = c(1, 5, 10) ) %>% readRefRanges()
Filtering samples
We can filter on select samples by passing a vector of taxa as character
objects using
the filterSamples()
method:
querySamples <- c( "Z001E0066-628NHAAXX_1", "Z003E0197-627C3AAXX_3", "Z003E0160-705VVAAXX_3" ) myCon %>% PHGMethod(myMethod) %>% filterSamples(querySamples) %>% readSamples()
Filtering on both reference ranges and samples
Finally, we can add everything together and query whole tables. This will filter on both rows (taxa) and columns (reference ranges):
myCon %>% PHGMethod(myMethod) %>% filterRefRanges( gr = queryRanges, chromosome = c(1, 5, 10) ) %>% filterSamples(querySamples) %>% readTable()
How to ask for help and report bugs
Asking questions
Please use Biostars to ask for help. Instructions for using Biostars are here
PHG Workshop (June 4-8, 2018): Cornell University - Ithaca
PHG Workshop (August 17-18, 2018): IRRI - Philippines
Updated