Difficulty formatting GFF input
I have a hard time converting my BED into an appropriate GFF tat can be parsed. I tried using the bedtogff module in Genepattern but still not quite right. See attachedment. what is the smoothest way to convert bed into gff? Can you share a script? I ensured the example data ran fine.
Comments (3)
-
-
A one-liner for converting macs bed/peak files into ROSE recognizable GFF files:
awk '{OFS="\t"; print $1, $4, ".", $2, $3, ".",".",".", $4}' yourpeakfile > yourgfffile
-
@Sizun Jiang: I used the command you suggested, and my GFF file was like: chr1 MACS_peak_1 . 4775237 4776463 . . . MACS_peak_1 chr1 MACS_peak_2 . 4797355 4798114 . . . MACS_peak_2 chr1 MACS_peak_3 . 4847358 4848597 . . .
the example GFF file was like: chr1 MM1S_MED1_DMSO_2_250 24904025 24905384 . MM1S_MED1_DMSO_2_250 chr6 MM1S_MED1_DMSO_2_15508 34963157 34965806 . MM1S_MED1_DMSO_2_15508 chr6 MM1S_MED1_DMSO_2_15426 31346865 31348446 . MM1S_MED1_DMSO_2_15426 chr5 MM1S_MED1_DMSO_2_14793 148193195 148195420 . MM1S_MED1_DMSO_2_14793 chr22 MM1S_MED1_DMSO_2_11539 28279173 28280453 . MM1S_MED1_DMSO_2_11539
so I tweaked the command into: awk '{OFS="\t"; print $1, $4, $2, $3, ".", $4}' H3K27ac_MBDCKONCprec_peaks.bed > H3K27ac_MBDCKONCprec_peaks.gff
then the GFF file looks more superficially like the example GFF file: chr1 MACS_peak_1 4775237 4776463 . MACS_peak_1 chr1 MACS_peak_2 4797355 4798114 . MACS_peak_2 chr1 MACS_peak_3 4847358 4848597 . MACS_peak_3 chr1 MACS_peak_4 5072961 5073762 . MACS_peak_4
But then when I ran python ROSE_main.py -g mm9 -i H3K27ac_MBDCKONCprec_peaksnew.gff -r H3K27ac_MBDCKONCprec.sorted.bam -c TFChIP_tCD25input.sort.bam -o precursortry -s 12500 -t 2500 &
the script just ignore every line in the GFF file: USING mm9 AS THE GENOME MAKING START DICT LOADING IN GFF REGIONS SKIPPING THIS LINE ['chr1', 'MACS_peak_1', '4775237', '4776463', '.', 'MACS_peak_1'] SKIPPING THIS LINE ['chr1', 'MACS_peak_2', '4797355', '4798114', '.', 'MACS_peak_2'] SKIPPING THIS LINE ['chr1', 'MACS_peak_3', '4847358', '4848597', '.', 'MACS_peak_3']
IS this because something wrong with my GFF file? if so, how am I supposed to create the right format? Thanks!
- Log in to comment
I've had issues with the whole .gff format but was able to get my analysis to work like this:
Open .bed file in Excel and trim columns accordingly.
1: chromosome (chr1 format) 2: Peak ID 3: Leave empty? 4: Peak start 5: Peak stop 6: Leave empty? 7: Strand (just put full stop [ . ] in each row) 8: Leave empty? 9: Peak ID
Save as .txt.
Import file into Galaxy. It will give the file a ‘.tabular’ extension. Export and save as new file with the .tabular extension.
I did this and ROSE was able to recognise the .tabular file as the input (for some reason?). Got my analysis in about 2hrs.
Good Luck!