Input files and how to get them

Issue #15 resolved
Rola Dali created an issue

Hello,

I am trying to run CHiCAGO to incorporate it in our capture HiC analysis pipelines. I am having some trouble with my input files and would like to understand how to get the files in the correct format.

Here is my current understanding: I will need 6 files, as follows:

1- restriction map file (.rmap)

2- Baitmap file (.baitmap)

3- nperbin file (.npb)

4- nbaitsperbin file (.nbpb)

5- proxOE file (.poe)

6- input file (.chinput)

What I have:

-bam aligned by HiCUP

-HiCUP digest file

-bait coordinates


My understanding is that I can get files 3,4,5 by running makeDesignFiles.py but I will need .rmap and .baitmap.

I can also get file 6 by using bam2chicago.sh. but I also need .rmap and .baitmap.

So i am missing files 1 and 2 (.rmap, .baitmap) and am wondering how to get those. Are there scripts to do so?

I thought I can make the .rmap file by taking the first 3 columns of the HiCUP digest file and assigning an arbitrary "numeric ID" and I can make the .baitmap from the 3 first columns of the chip bait coordinates and an arbitrary "numeric ID". but I am confused by the following comment in the README.md file " [bait coordinates] (should be a subset of the fragments listed in rmapfile), their numeric IDs (should match those listed in rmapfile for the corresponding fragments)".

Is there a better way to produce the .rmap and .baitmap files and how do I coordinate the ID of the rmap file with those in the baitmap file?

Thank you, Rola

Comments (8)

  1. Mikhail Spivakov

    Sorry about the confusion.

    You are right about making the .rmap file from the HiCUP digest file. All you need to do to make a baitmap file is take the subset of .rmap rows that corresponds to baited fragments and save them as .baitmap.

    Hope it helps!

  2. Dimitris Hat

    can someone explain to me what this means : "All you need to do to make a baitmap file is take the subset of .rmap rows that corresponds to baited fragments and save them as .baitmap." ?

  3. Mehmet Tekman

    I have to admit that the lack of answers for this problem that likely 99% of people who used HiCUP and then Chicago have faced is really startling.

    Desired Solution

    So the solution should be to download the create_baitmap_rmap.pl file (which for some reason isn’t included in the latest conda), and then run:

    perl create_baitmap_rmap.pl Digest_Mouse_GRCm38_NlaIII.txt my_target_sites.bed
    

    where, Digest_Mouse_GRCm38_NlaIII.txt is a file that contains ~10M lines like:

    Genome:Mouse_GRCm38     Restriction_Enzyme1:NlaIII [CATG^]      Restriction_Enzyme2:None        Hicup Digester vers>
    Chromosome      Fragment_Start_Position Fragment_End_Position   Fragment_Number RE1_Fragment_Number     5'_Restrict>
    1       1       3000185 1       1       None    Re1
    1       3000186 3000316 2       2       Re1     Re1
    1       3000317 3000850 3       3       Re1     Re1
    

    and my_target_sites.bed contains ~100 lines like:

    1       4496944 4497063 Sox17-1
    1       4497564 4497683 Sox17-2
    1       71651908        71652027        Fn1-1
    1       71652056        71652175        Fn1-2
    

    (where the last column are my unique feature names that I invented)

    So running the above perl command should generate the rmap and bait files, but in practice, what happens is that the rmap file is generated, and bait file is left blank, with numerous errors printed to the console like “Use of uninitialized value in string ne at create_baitmap_rmap.pl line 103”

    Real Solution

    Okay, so the above didn’t work. What do we do about it?

    Well, we take a look at the create_baitmap_rmap.pl file and notice these key lines:

      while (<DIGEST2>) {
        my $chromosome_name            = ( split /\t/ )[0];
        my $first_base                 = ( split /\t/ )[1];
        my $last_base                  = ( split /\t/ )[2];
        #my $fragment_number            = ( split /\t/ )[3];
        my $ten_kb_region              = ceil( $first_base / 10000 );
    
        print RMAP "$chromosome_name\t$first_base\t$last_base\t$index\n";
    
        if($digest_fragments{"$chromosome_name\t$ten_kb_region"}->{"$first_base\t$last_base"} ne ''){       
            my $feature = $digest_fragments{"$chromosome_name\t$ten_kb_region"}->{"$first_base\t$last_base"};
            print BAITMAP "$chromosome_name\t$first_base\t$last_base\t$index\t$feature\n";
        }
        $index++;
      }
    

    which tells us that the RMAP file is literally the first 3 columns of the digest file (skipping the first two lines), with a rownumber (“$index”) as the fourth column.

    It also tells us in a roundabout way that the BAITMAP is a subset of the RMAP file, where the only information from the target bed file that is appended is the “$feature” (i.e. the 4th column of the bed, e.g. Sox17-1, Sox17-2).

    So we can actually generate the RMAP and BAITMAP files pretty easily with existing bioinformatic tools once we know this.

    • Generate the RMAP file:

      • Skip first two lines of digest, print columns 1-3, and give a 1-based index for the fourth column.
    cat Digest_Mouse_GRCm38_NlaIII.txt \
      | awk 'NR>2 { print $1"\t"$2"\t"$3"\t"(FNR-2)}' \
      > Digest_Mouse_GRCm38_NlaIII.txt.rmap
    

    • Generate the BAITMAP file (using bedtools intersect)

      • We do a “left-outer-join” of our RMAP file with our target bed file, which should paste any intersecting positions of our BED file right next to the positions of the RMAP file, or paste a “-1” if there is no intersection, uniting both files into an 8 column file (RMAP:columns 1-4, BED: columns 5-8).
      • We then filter out any lines with non zero values in column 5, and then print all the information from the RMAP (columns: 1-4) plus the feature name of the bed file (column:8), and this will give us the desired BAITMAP file.
    bedtools intersect -loj \
      -a Digest_Mouse_GRCm38_NlaIII.txt.rmap \
      -b my_target_sites.bed \
      | awk '$5 >= 0 {print $1"\t"$2"\t"$3"\t"$4"\t"$8}' \
      > Digest_Mouse_GRCm38_NlaIII.txt.bait
    

    And tadaa, you now have your two files that you can feed into bam2chicago to convert your files from HiCUP to Chicago.

    Bam2Chicago

    Ideally you just run:

    bam2chicago.sh \
     sample1.hicup.bam \
     Digest_Mouse_GRCm38_NlaIII.txt.bait \
     Digest_Mouse_GRCm38_NlaIII.txt.rmap \
     sample1
    

    But this also doesn’t really work, complaining of “Error: Type checker found wrong number of fields while tokenizing data line.”

    So you actually have to download the newer bam2chicago_V02.sh from this repo, and then run

    chmod +x ./bam2chicago_V02.sh
    ./bam2chicago_V02.sh \
      -b sample1.hicup.bam \
      -t Digest_Mouse_GRCm38_NlaIII.txt.bait \
      -r Digest_Mouse_GRCm38_NlaIII.txt.rmap \
      -o sample1
    

    and then you can generate the other files needed:

    makeDesignFiles.py \
     --rmapfile=Digest_Mouse_GRCm38_NlaIII.txt.rmap \
     --baitmapfile=Digest_Mouse_GRCm38_NlaIII.txt.bait
    

    And now you can hopefully proceed with the rest of the downstream analysis.

    I really wish I didn’t have to write this post, but hopefully someone out there finds it enlightening.

  4. Mikhail Spivakov

    Hi Mehmet

    Thanks so much for adding this info! As academic developers of this open source software, we do occasionally fall short on supporting our valued user base and rely on community contributors like you to make sure we stay up to scratch!

    All the best,

    Mikhail

  5. Mehmet Tekman

    Hi Mikhail, forgive my tone in the above post -- it has been an extremely long few weeks for me.

    I am greatly appreciative of Chicago, and of the Capture-C pipelines I’ve tested so far, HiCUP+Chicago is easily the best combination in terms of reliability and transparency of analysis

    The last thing open source developers need is discouragement from their users, and I apologize again for my tone. Please do keep up the good work!

    Warm regards,
    Mehmet

  6. Log in to comment