Difficulty formatting GFF input

Peter McErlean

I've had issues with the whole .gff format but was able to get my analysis to work like this:

Open .bed file in Excel and trim columns accordingly.

1: chromosome (chr1 format) 2: Peak ID 3: Leave empty? 4: Peak start 5: Peak stop 6: Leave empty? 7: Strand (just put full stop [ . ] in each row) 8: Leave empty? 9: Peak ID

Save as .txt.

Import file into Galaxy. It will give the file a ‘.tabular’ extension. Export and save as new file with the .tabular extension.

I did this and ROSE was able to recognise the .tabular file as the input (for some reason?). Got my analysis in about 2hrs.

Good Luck!

2015-05-08T14:26:48+00:00

Sizun Jiang

A one-liner for converting macs bed/peak files into ROSE recognizable GFF files:

awk '{OFS="\t"; print $1, $4, ".", $2, $3, ".",".",".", $4}' yourpeakfile > yourgfffile

2016-03-14T02:51:59+00:00

Jun

@Sizun Jiang: I used the command you suggested, and my GFF file was like: chr1 MACS_peak_1 . 4775237 4776463 . . . MACS_peak_1 chr1 MACS_peak_2 . 4797355 4798114 . . . MACS_peak_2 chr1 MACS_peak_3 . 4847358 4848597 . . .

the example GFF file was like: chr1 MM1S_MED1_DMSO_2_250 24904025 24905384 . MM1S_MED1_DMSO_2_250 chr6 MM1S_MED1_DMSO_2_15508 34963157 34965806 . MM1S_MED1_DMSO_2_15508 chr6 MM1S_MED1_DMSO_2_15426 31346865 31348446 . MM1S_MED1_DMSO_2_15426 chr5 MM1S_MED1_DMSO_2_14793 148193195 148195420 . MM1S_MED1_DMSO_2_14793 chr22 MM1S_MED1_DMSO_2_11539 28279173 28280453 . MM1S_MED1_DMSO_2_11539

so I tweaked the command into: awk '{OFS="\t"; print $1, $4, $2, $3, ".", $4}' H3K27ac_MBDCKONCprec_peaks.bed > H3K27ac_MBDCKONCprec_peaks.gff

then the GFF file looks more superficially like the example GFF file: chr1 MACS_peak_1 4775237 4776463 . MACS_peak_1 chr1 MACS_peak_2 4797355 4798114 . MACS_peak_2 chr1 MACS_peak_3 4847358 4848597 . MACS_peak_3 chr1 MACS_peak_4 5072961 5073762 . MACS_peak_4

But then when I ran python ROSE_main.py -g mm9 -i H3K27ac_MBDCKONCprec_peaksnew.gff -r H3K27ac_MBDCKONCprec.sorted.bam -c TFChIP_tCD25input.sort.bam -o precursortry -s 12500 -t 2500 &

the script just ignore every line in the GFF file: USING mm9 AS THE GENOME MAKING START DICT LOADING IN GFF REGIONS SKIPPING THIS LINE ['chr1', 'MACS_peak_1', '4775237', '4776463', '.', 'MACS_peak_1'] SKIPPING THIS LINE ['chr1', 'MACS_peak_2', '4797355', '4798114', '.', 'MACS_peak_2'] SKIPPING THIS LINE ['chr1', 'MACS_peak_3', '4847358', '4848597', '.', 'MACS_peak_3']

IS this because something wrong with my GFF file? if so, how am I supposed to create the right format? Thanks!

2016-05-07T08:51:55+00:00

Comments (3)