What is RGmatch for?
RGmatch is a flexible and easy-to-use tool to match genomic regions to the closest gene (also transcript or exon), which provides the area of the gene where the region overlaps. The algorithm can be applied to any organism as long as the genome annotation is available.
What input files do I need?
In order to compute the associations, RGmatch needs the following information:
- A GTF annotation file providing the chromosome positions of all the features to be considered (genes, transcripts and exons). This GTF file should include annotations at exon level, that is, the 3rd column of the GTF must contain "exon" tag.
- A BED file containing the regions of interest to be associated to features.
Both the GTF and the BED files could be passed compressed in Gzip format.
Chromosome naming convention should be the same in the GTF and the BED file !
You can either download the GTF file from any public database or use your own built GTF file as long as it has the following format:
1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "RP23-271O17.1"; gene_source "havana"; gene_biotype "TEC"; 1 havana transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; gene_name "RP23-271O17.1"; gene_source "havana"; gene_biotype "TEC"; transcript_name "RP23-271O17.1-001"; transcript_source "havana"; transcript_biotype "TEC"; tag "basic"; 1 havana exon 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; exon_number "1"; gene_name "RP23-271O17.1"; gene_source "havana"; gene_biotype "TEC"; transcript_name "RP23-271O17.1-001"; transcript_source "havana"; transcript_biotype "TEC"; exon_id "ENSMUSE00001343744"; exon_version "1"; tag "basic";
The current example corresponds to the Mus musculus annotation from Ensembl (mmusculus.gtf).
The BED file used as input for RGmatch should have at least three columns: chromosome, start position, end position (read more about BED format at BED specification). Example:
1 10 15 1 103 129 ...
These are the arguments for the RGmatch algorithm:
Usage: python rgmatch.py [options] <mandatory> Options: -r, --report: Report at the 'exon', 'transcript' or 'gene' level. Default: 'exon' -q, --distance: Maximum distance in kb to report associations. Default: 10 (10kb) -t, --tss: TSS region distance. Default: 200 bps -s, --tts: TTS region distance. Default: 0 bps -p, --promoter: Promoter region distance. Default: 1300 bps -v, --perc_area: Percentage of the area of the gene overlapped to be considered to discriminate at transcript and gene level. Default: 90 (90%) -w, --perc_region: Percentage of the region overlapped by the gene to be considered to discriminate at transcript and gene level. Default: 50 (50%) -R, --rules: File containing the priorities in case of ties. Default: TSS,1st_EXON,PROMOTER,TTS,INTRON,GENE_BODY,UPSTREAM,DOWNSTREAM -G, --gene: GTF tag used to get gene ids/names. Default: gene_id -T, --transcript: GTF tag used to get transcript ids/names. Default: transcript_id -h, --help: show this help message and exit Mandatory: -g, --gtf: GTF annotation file -b, --bed: Region bed file -o, --output: Output file 05/11/2015. Pedro Furió Tarí.
To run RGmatch with default parameters, the following code should be used:
python rgmatch.py -g Mus_musculus.GRCm38.79.gtf -b myregions.bed -o myassociations.txt
In this case, the region to gene associations will be computed as follows:
- All possible associations and areas will be reported (aggregation at exon level). This means that if a region overlaps several areas of a given gene, all of them will be returned.
- The maximum distance of feature associations will be of 10 kb upstream or downstream.
- The TSS area will start 200 nucleotides upstream the TSS and will end at the TSS (see Figure below).
- The Promoter area will have a length of 1300 nucleotides. Thus, it will start at 1500 nucleotides upstream the TSS and end at 200 nucleotides upstream the TSS (see Figure below).
When the aggregation level "transcript" or "gene" is chosen, the algorithm selects a unique annotation area according to the values assigned to the %Region, and %Area parameters and the decision rules shown in the diagram below. If there is a transcript area for which %Region >= (50% by default), this area will be the annotation for that region-transcript association. Otherwise, the algorithm takes the area with %Area >= (90% by default. When this condition is met by several areas, the one with max(%Region) will be selected. In case of ties, the selected area is determined according to the priorities assigned by the user. By default: TSS, 1st_EXON, PROMOTER, TTS, INTRON, GENE_BODY, UPSTREAM, DOWNSTREAM.
If we want the results to be aggregated at transcript or gene level using the predefined rules, we need to add the "-r or --report" argument as follows:
python rgmatch.py -g Mus_musculus.GRCm38.79.gtf -b myregions.bed -r transcript -o myassociations_transcript.txt python rgmatch.py -g Mus_musculus.GRCm38.79.gtf -b myregions.bed -r gene -o myassociations_gene.txt
It is possible to change the percentages of the area of the gene overlapping the region when computing the associations at transcript and gene level. These parameters can be modified by adding the arguments "-w or --perc_region" and "-v or --perc_area", respectively, followed by the new percentages.
The area priorities to be used in case of ties can also be modified. Defining new priorities is as simple as typing -R followed by the new desired rules separated by commas and with no spaces between them. In case of missing areas or misspelling, RGmatch will not work.
python rgmatch.py -g Mus_musculus.GRCm38.79.gtf -b myregions.bed -r gene -R TSS,PROMOTER,1st_EXON,GENE_BODY,TTS,INTRON,UPSTREAM,DOWNSTREAM -o myassociations_gene_newPriorities.txt
The output file generated by RGmatch is a matrix containing in each row the different associations the algorithm was able to find. The columns contain the following information:
- Region: Identifier of the region that is being associated. This identifier is created from chromosome position like this: chrom_start_end.
- Midpoint: Midpoint of the region being reported.
- Gene: Gene name or id associated to the region.
- Transcript: Transcript name or id associated to the region. When choosing gene aggregation level, this column can contain more than one transcript separated by "_", meaning that there are ties between those transcripts.
- Exon: When choosing exon aggregation level, this column reports the exon number the region has been associated with. Otherwise, -1.
- Area: It will be one of the following: TSS, 1st_EXON, PROMOTER, INTRON, GENE_BODY, UPSTREAM, DOWNSTREAM.
- Distance: Distance from the midpoint of the region to the TSS (for UPSTREAM, PROMOTER or TSS area), to the TTS (for DOWNSTREAM or TTS area), or 0 if the region overlaps the gene (GENE_BODY, 1st_EXON or INTRON areas).
- PercRegion: Percentage of the region overlapping the reported area.
- PercArea: Percentage of the area overlapped by the reported region.
Pedro Furió / Sonia Tarazona
Genomics of Gene Expression Lab
Centro de Investigación Príncipe Felipe
Eduardo Primo Yúfera, 3 46012 Valencia (Spain)