Wiki

Clone wiki

sparkseq / InputFiles

The GTF format is, unfortunately, a not very well specified file format. Several standard documents exist, from different groups, which contradict each other in some points. This are just short guides on feature array input file formats supported by SparkSeq using Ensembl's GTF files:

1. BED-like format with exons coordinates (merged transcripts)

In order to convert GTF file format to get merged exons list of Homo_sappiens you can follow these steps:

  • Run this commands from your Linux shell inside bin directory of your bedtools2 build directory:
#pick up columns
grep -i "exon" ../../Homo_sapiens.GRCh37.74.gtf | grep -v stop_codon|grep -v start_codon|grep -v CDS |cut -f1,4,5,6,9  | cut -f1,3 -d";" | sed 's/"//g' | sed 's/gene_id//g'| sed 's/; exon_number /\t/g' > ../../Homo_sapiens.GRCh37.74_exons.bed

#cleanup
cat ../../Homo_sapiens.GRCh37.74_exons.bed | grep -i "^[[:digit:]]\|^MT\|X\|Y" | sed 's/^/chr/g' >../../Homo_sapiens.GRCh37.74_exons_chr.bed

#sort
sort -k1,1 -k2,2n -k3,3n ../../Homo_sapiens.GRCh37.74_exons_chr.bed |awk '{print($1,"\011",$2,"\011",$3,"\011",$5)}' | sed 's/ //g'> ../../Homo_sapiens.GRCh37.74_exons_chr_sort.bed

#merge
./mergeBed -nms -i ../../Homo_sapiens.GRCh37.74_exons_chr_sort.bed | cut -f1 -d',' > ../../Homo_sapiens.GRCh37.74_exons_chr_merged.bed

#add artificial key for merged exons
i=0; cat ../../Homo_sapiens.GRCh37.74_exons_chr_merged.bed | while read line; do echo -e "$line\t$i">> ../../Homo_sapiens.GRCh37.74_exons_chr_merged_id.bed; i=$((i+1)); done

#cleanup
cat ../../Homo_sapiens.GRCh37.74_exons_chr_merged_id.bed | awk '{print($1,"\t",$2,"\t",$3,"\t",".","\t",$4,"\t",$5)}' | sed 's/ //g' > ../../Homo_sapiens.GRCh37.74_exons_chr_merged_id_st.bed
  • Homo_sapiens.GRCh37.74_exons_chr_merged_id_st.bed file is the one you can use with SparkSeq

Already pre-processed BED-like files for human:

Homo_sapiens.GRCh37.74_exons_chr_merged_id_st.bed

2. BED-like format with exons coordinates (unique transcripts)

  • Run this commands from your Linux shell inside bin directory of your bedtools2 build directory:
#pick up columns
cat ../../Homo_sapiens.GRCh37.74.gtf |grep -i "exon" | grep -v stop_codon|grep -v start_codon|grep -v CDS |cut -f1,4,5,6,9  | cut -f1,7 -d";" | sed 's/gene_id//g'| sed 's/; exon_id /\t/g' | sed 's/"//g' | sed 's/ //g' | sed 's/^/chr/g' > ../../Homo_sapiens.GRCh37.74_exons.bed

#cleanup
sort -k1,1 -k2,2n -k3,3n ../../Homo_sapiens.GRCh37.74_exons.bed |awk '{print($1,"\011",$2,"\011",$3,"\011",$4,"\011",$5,"\011",$6)}' | sed 's/ //g' | uniq > ../../Homo_sapiens.GRCh37.74_exons_chr_sort.bed

#group by
./groupBy -i ../../Homo_sapiens.GRCh37.74_exons_chr_sort.bed -g 1,2,3,4,5 -c 6 -o first >  ../../Homo_sapiens.GRCh37.74_exons_chr_sort_uniq.bed

Already pre-processed BED-like files for human can be downloaded from here:

Homo_sapiens.GRCh37.74_exons_chr_sort_uniq.bed

3. BED-like format with genes coordinates (merged transcripts)

  • Run this commands from your Linux shell inside bin directory of your bedtools2 build directory:
#pick up columns
cat ../../Homo_sapiens.GRCh37.74.gtf |grep -i "exon" | grep -v stop_codon|grep -v start_codon|grep -v CDS |cut -f1,4,5,6,9  | cut -f1,7 -d";" | sed 's/gene_id//g'| sed 's/; exon_id /\t/g' | sed 's/"//g' | sed 's/ //g' | sed 's/^/chr/g' > ../../Homo_sapiens.GRCh37.74_exons.bed

#cleanup
sort -k1,1 -k2,2n -k3,3n ../../Homo_sapiens.GRCh37.74_exons.bed |awk '{print($1,"\011",$2,"\011",$3,"\011",$4,"\011",$5)}' | sed 's/ //g' | uniq > ../../Homo_sapiens.GRCh37.74_genes_chr_sort.bed

#merge
./groupBy -i ../../Homo_sapiens.GRCh37.74_genes_chr_sort.bed -g 1,4,5 -c 2,3 -o min,max > ../../Homo_sapiens.GRCh37.74_genes_chr_merged.bed


cat ../../Homo_sapiens.GRCh37.74_genes_chr_merged.bed | awk '{print($1,"\t",$4,"\t",$5,"\t",".","\t",$3,"\t",$3)}' | sed 's/ //g' > ../../Homo_sapiens.GRCh37.74_genes_chr_merged_swap.bed

Already pre-processed BED-like files for human can be downloaded from here:

Homo_sapiens.GRCh37.74_genes_chr_merged_swap.bed

Updated