Difficulty formatting GFF input

Issue #12 new
Former user created an issue

I have a hard time converting my BED into an appropriate GFF tat can be parsed. I tried using the bedtogff module in Genepattern but still not quite right. See attachedment. what is the smoothest way to convert bed into gff? Can you share a script? I ensured the example data ran fine.

Comments (3)

  1. Peter McErlean

    I've had issues with the whole .gff format but was able to get my analysis to work like this:

    Open .bed file in Excel and trim columns accordingly.

    1: chromosome (chr1 format) 2: Peak ID 3: Leave empty? 4: Peak start 5: Peak stop 6: Leave empty? 7: Strand (just put full stop [ . ] in each row) 8: Leave empty? 9: Peak ID

    Save as .txt.

    Import file into Galaxy. It will give the file a ‘.tabular’ extension. Export and save as new file with the .tabular extension.

    I did this and ROSE was able to recognise the .tabular file as the input (for some reason?). Got my analysis in about 2hrs.

    Good Luck!

  2. Sizun Jiang

    A one-liner for converting macs bed/peak files into ROSE recognizable GFF files:

    awk '{OFS="\t"; print $1, $4, ".", $2, $3, ".",".",".", $4}' yourpeakfile > yourgfffile

  3. Jun

    @Sizun Jiang: I used the command you suggested, and my GFF file was like: chr1 MACS_peak_1 . 4775237 4776463 . . . MACS_peak_1 chr1 MACS_peak_2 . 4797355 4798114 . . . MACS_peak_2 chr1 MACS_peak_3 . 4847358 4848597 . . . H3K27ac_MBDCKONCprec_peaks_Plot_points.png

    the example GFF file was like: chr1 MM1S_MED1_DMSO_2_250 24904025 24905384 . MM1S_MED1_DMSO_2_250 chr6 MM1S_MED1_DMSO_2_15508 34963157 34965806 . MM1S_MED1_DMSO_2_15508 chr6 MM1S_MED1_DMSO_2_15426 31346865 31348446 . MM1S_MED1_DMSO_2_15426 chr5 MM1S_MED1_DMSO_2_14793 148193195 148195420 . MM1S_MED1_DMSO_2_14793 chr22 MM1S_MED1_DMSO_2_11539 28279173 28280453 . MM1S_MED1_DMSO_2_11539

    so I tweaked the command into: awk '{OFS="\t"; print $1, $4, $2, $3, ".", $4}' H3K27ac_MBDCKONCprec_peaks.bed > H3K27ac_MBDCKONCprec_peaks.gff

    then the GFF file looks more superficially like the example GFF file: chr1 MACS_peak_1 4775237 4776463 . MACS_peak_1 chr1 MACS_peak_2 4797355 4798114 . MACS_peak_2 chr1 MACS_peak_3 4847358 4848597 . MACS_peak_3 chr1 MACS_peak_4 5072961 5073762 . MACS_peak_4

    But then when I ran python ROSE_main.py -g mm9 -i H3K27ac_MBDCKONCprec_peaksnew.gff -r H3K27ac_MBDCKONCprec.sorted.bam -c TFChIP_tCD25input.sort.bam -o precursortry -s 12500 -t 2500 &

    the script just ignore every line in the GFF file: USING mm9 AS THE GENOME MAKING START DICT LOADING IN GFF REGIONS SKIPPING THIS LINE ['chr1', 'MACS_peak_1', '4775237', '4776463', '.', 'MACS_peak_1'] SKIPPING THIS LINE ['chr1', 'MACS_peak_2', '4797355', '4798114', '.', 'MACS_peak_2'] SKIPPING THIS LINE ['chr1', 'MACS_peak_3', '4847358', '4848597', '.', 'MACS_peak_3']

    IS this because something wrong with my GFF file? if so, how am I supposed to create the right format? Thanks!

  4. Log in to comment