Input files and how to get them

Mehmet Tekman

I have to admit that the lack of answers for this problem that likely 99% of people who used HiCUP and then Chicago have faced is really startling.

Desired Solution

So the solution should be to download the create_baitmap_rmap.pl file (which for some reason isn’t included in the latest conda), and then run:

perl create_baitmap_rmap.pl Digest_Mouse_GRCm38_NlaIII.txt my_target_sites.bed

where, Digest_Mouse_GRCm38_NlaIII.txt is a file that contains ~10M lines like:

Genome:Mouse_GRCm38     Restriction_Enzyme1:NlaIII [CATG^]      Restriction_Enzyme2:None        Hicup Digester vers>
Chromosome      Fragment_Start_Position Fragment_End_Position   Fragment_Number RE1_Fragment_Number     5'_Restrict>
1       1       3000185 1       1       None    Re1
1       3000186 3000316 2       2       Re1     Re1
1       3000317 3000850 3       3       Re1     Re1

and my_target_sites.bed contains ~100 lines like:

1       4496944 4497063 Sox17-1
1       4497564 4497683 Sox17-2
1       71651908        71652027        Fn1-1
1       71652056        71652175        Fn1-2

(where the last column are my unique feature names that I invented)

So running the above perl command should generate the rmap and bait files, but in practice, what happens is that the rmap file is generated, and bait file is left blank, with numerous errors printed to the console like “Use of uninitialized value in string ne at create_baitmap_rmap.pl line 103”

‌

Real Solution

Okay, so the above didn’t work. What do we do about it?

Well, we take a look at the create_baitmap_rmap.pl file and notice these key lines:

  while (<DIGEST2>) {
    my $chromosome_name            = ( split /\t/ )[0];
    my $first_base                 = ( split /\t/ )[1];
    my $last_base                  = ( split /\t/ )[2];
    #my $fragment_number            = ( split /\t/ )[3];
    my $ten_kb_region              = ceil( $first_base / 10000 );

    print RMAP "$chromosome_name\t$first_base\t$last_base\t$index\n";

    if($digest_fragments{"$chromosome_name\t$ten_kb_region"}->{"$first_base\t$last_base"} ne ''){       
        my $feature = $digest_fragments{"$chromosome_name\t$ten_kb_region"}->{"$first_base\t$last_base"};
        print BAITMAP "$chromosome_name\t$first_base\t$last_base\t$index\t$feature\n";
    }
    $index++;
  }

which tells us that the RMAP file is literally the first 3 columns of the digest file (skipping the first two lines), with a rownumber (“$index”) as the fourth column.

It also tells us in a roundabout way that the BAITMAP is a subset of the RMAP file, where the only information from the target bed file that is appended is the “$feature” (i.e. the 4th column of the bed, e.g. Sox17-1, Sox17-2).

So we can actually generate the RMAP and BAITMAP files pretty easily with existing bioinformatic tools once we know this.

Generate the RMAP file:
- Skip first two lines of digest, print columns 1-3, and give a 1-based index for the fourth column.

cat Digest_Mouse_GRCm38_NlaIII.txt \
  | awk 'NR>2 { print $1"\t"$2"\t"$3"\t"(FNR-2)}' \
  > Digest_Mouse_GRCm38_NlaIII.txt.rmap

‌

Generate the BAITMAP file (using bedtools intersect)
- We do a “left-outer-join” of our RMAP file with our target bed file, which should paste any intersecting positions of our BED file right next to the positions of the RMAP file, or paste a “-1” if there is no intersection, uniting both files into an 8 column file (RMAP:columns 1-4, BED: columns 5-8).
- We then filter out any lines with non zero values in column 5, and then print all the information from the RMAP (columns: 1-4) plus the feature name of the bed file (column:8), and this will give us the desired BAITMAP file.

bedtools intersect -loj \
  -a Digest_Mouse_GRCm38_NlaIII.txt.rmap \
  -b my_target_sites.bed \
  | awk '$5 >= 0 {print $1"\t"$2"\t"$3"\t"$4"\t"$8}' \
  > Digest_Mouse_GRCm38_NlaIII.txt.bait

And tadaa, you now have your two files that you can feed into bam2chicago to convert your files from HiCUP to Chicago.

Bam2Chicago

Ideally you just run:

bam2chicago.sh \
 sample1.hicup.bam \
 Digest_Mouse_GRCm38_NlaIII.txt.bait \
 Digest_Mouse_GRCm38_NlaIII.txt.rmap \
 sample1

But this also doesn’t really work, complaining of “Error: Type checker found wrong number of fields while tokenizing data line.”

So you actually have to download the newer bam2chicago_V02.sh from this repo, and then run

chmod +x ./bam2chicago_V02.sh
./bam2chicago_V02.sh \
  -b sample1.hicup.bam \
  -t Digest_Mouse_GRCm38_NlaIII.txt.bait \
  -r Digest_Mouse_GRCm38_NlaIII.txt.rmap \
  -o sample1

and then you can generate the other files needed:

makeDesignFiles.py \
 --rmapfile=Digest_Mouse_GRCm38_NlaIII.txt.rmap \
 --baitmapfile=Digest_Mouse_GRCm38_NlaIII.txt.bait

And now you can hopefully proceed with the rest of the downstream analysis.

‌

I really wish I didn’t have to write this post, but hopefully someone out there finds it enlightening.

‌

2024-01-31T10:33:52+00:00

Comments (8)

Rola Dali reporter
- edited description
- 2017-09-11T16:38:00+00:00
Mikhail Spivakov
Sorry about the confusion.

You are right about making the .rmap file from the HiCUP digest file. All you need to do to make a baitmap file is take the subset of .rmap rows that corresponds to baited fragments and save them as .baitmap.

Hope it helps!
- 2017-09-11T16:39:42+00:00
Mikhail Spivakov
- changed status to resolved
- 2018-05-10T15:01:02+00:00
Dimitris Hat
can someone explain to me what this means : "All you need to do to make a baitmap file is take the subset of .rmap rows that corresponds to baited fragments and save them as .baitmap." ?
- 2019-02-06T12:39:09+00:00
Mehmet Tekman
I have to admit that the lack of answers for this problem that likely 99% of people who used HiCUP and then Chicago have faced is really startling.

Desired Solution

So the solution should be to download the create_baitmap_rmap.pl file (which for some reason isn’t included in the latest conda), and then run:
```
perl create_baitmap_rmap.pl Digest_Mouse_GRCm38_NlaIII.txt my_target_sites.bed
```
where, Digest_Mouse_GRCm38_NlaIII.txt is a file that contains ~10M lines like:
```
Genome:Mouse_GRCm38     Restriction_Enzyme1:NlaIII [CATG^]      Restriction_Enzyme2:None        Hicup Digester vers>
Chromosome      Fragment_Start_Position Fragment_End_Position   Fragment_Number RE1_Fragment_Number     5'_Restrict>
1       1       3000185 1       1       None    Re1
1       3000186 3000316 2       2       Re1     Re1
1       3000317 3000850 3       3       Re1     Re1
```
and my_target_sites.bed contains ~100 lines like:
```
1       4496944 4497063 Sox17-1
1       4497564 4497683 Sox17-2
1       71651908        71652027        Fn1-1
1       71652056        71652175        Fn1-2
```
(where the last column are my unique feature names that I invented)

So running the above perl command should generate the rmap and bait files, but in practice, what happens is that the rmap file is generated, and bait file is left blank, with numerous errors printed to the console like “Use of uninitialized value in string ne at create_baitmap_rmap.pl line 103”

‌

Real Solution

Okay, so the above didn’t work. What do we do about it?

Well, we take a look at the create_baitmap_rmap.pl file and notice these key lines:
```
  while (<DIGEST2>) {
    my $chromosome_name            = ( split /\t/ )[0];
    my $first_base                 = ( split /\t/ )[1];
    my $last_base                  = ( split /\t/ )[2];
    #my $fragment_number            = ( split /\t/ )[3];
    my $ten_kb_region              = ceil( $first_base / 10000 );

    print RMAP "$chromosome_name\t$first_base\t$last_base\t$index\n";

    if($digest_fragments{"$chromosome_name\t$ten_kb_region"}->{"$first_base\t$last_base"} ne ''){       
        my $feature = $digest_fragments{"$chromosome_name\t$ten_kb_region"}->{"$first_base\t$last_base"};
        print BAITMAP "$chromosome_name\t$first_base\t$last_base\t$index\t$feature\n";
    }
    $index++;
  }
```
which tells us that the RMAP file is literally the first 3 columns of the digest file (skipping the first two lines), with a rownumber (“$index”) as the fourth column.

It also tells us in a roundabout way that the BAITMAP is a subset of the RMAP file, where the only information from the target bed file that is appended is the “$feature” (i.e. the 4th column of the bed, e.g. Sox17-1, Sox17-2).

So we can actually generate the RMAP and BAITMAP files pretty easily with existing bioinformatic tools once we know this.
- Generate the RMAP file:
  - Skip first two lines of digest, print columns 1-3, and give a 1-based index for the fourth column.
```
cat Digest_Mouse_GRCm38_NlaIII.txt \
  | awk 'NR>2 { print $1"\t"$2"\t"$3"\t"(FNR-2)}' \
  > Digest_Mouse_GRCm38_NlaIII.txt.rmap
```
‌
- Generate the BAITMAP file (using bedtools intersect)
  - We do a “left-outer-join” of our RMAP file with our target bed file, which should paste any intersecting positions of our BED file right next to the positions of the RMAP file, or paste a “-1” if there is no intersection, uniting both files into an 8 column file (RMAP:columns 1-4, BED: columns 5-8).
  - We then filter out any lines with non zero values in column 5, and then print all the information from the RMAP (columns: 1-4) plus the feature name of the bed file (column:8), and this will give us the desired BAITMAP file.
```
bedtools intersect -loj \
  -a Digest_Mouse_GRCm38_NlaIII.txt.rmap \
  -b my_target_sites.bed \
  | awk '$5 >= 0 {print $1"\t"$2"\t"$3"\t"$4"\t"$8}' \
  > Digest_Mouse_GRCm38_NlaIII.txt.bait
```
And tadaa, you now have your two files that you can feed into bam2chicago to convert your files from HiCUP to Chicago.

Bam2Chicago

Ideally you just run:
```
bam2chicago.sh \
 sample1.hicup.bam \
 Digest_Mouse_GRCm38_NlaIII.txt.bait \
 Digest_Mouse_GRCm38_NlaIII.txt.rmap \
 sample1
```
But this also doesn’t really work, complaining of “Error: Type checker found wrong number of fields while tokenizing data line.”

So you actually have to download the newer bam2chicago_V02.sh from this repo, and then run
```
chmod +x ./bam2chicago_V02.sh
./bam2chicago_V02.sh \
  -b sample1.hicup.bam \
  -t Digest_Mouse_GRCm38_NlaIII.txt.bait \
  -r Digest_Mouse_GRCm38_NlaIII.txt.rmap \
  -o sample1
```
and then you can generate the other files needed:
```
makeDesignFiles.py \
 --rmapfile=Digest_Mouse_GRCm38_NlaIII.txt.rmap \
 --baitmapfile=Digest_Mouse_GRCm38_NlaIII.txt.bait
```
And now you can hopefully proceed with the rest of the downstream analysis.

‌

I really wish I didn’t have to write this post, but hopefully someone out there finds it enlightening.

‌
- 2024-01-31T10:33:52+00:00
Mikhail Spivakov
Hi Mehmet

Thanks so much for adding this info! As academic developers of this open source software, we do occasionally fall short on supporting our valued user base and rely on community contributors like you to make sure we stay up to scratch!

All the best,

Mikhail
- 2024-01-31T21:48:10+00:00
Mehmet Tekman
Hi Mikhail, forgive my tone in the above post -- it has been an extremely long few weeks for me.

I am greatly appreciative of Chicago, and of the Capture-C pipelines I’ve tested so far, HiCUP+Chicago is easily the best combination in terms of reliability and transparency of analysis

The last thing open source developers need is discouragement from their users, and I apologize again for my tone. Please do keep up the good work!

Warm regards,
Mehmet
- 2024-01-31T22:29:57+00:00
Mikhail Spivakov
Thanks for your kind words and no worries!
- 2024-01-31T22:58:01+00:00
Log in to comment