Weird observedMutations results

Issue #166 resolved
Shaowen Jiang created an issue

Dear Shazam developers, I got unexpectedly high number of mut_freq and mut_count in my dataset. But if I took the sequence out to do Igblast, it shows 100% identity to the germline. Here’s the largest mut_freq sequence in my data as an example.

tmp$sequence_alignment
[1] "CAGCTGGTGGAGTCTGGGGGAGGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTGACTATGGAATGCACTGGGTTCGTCAGGCTCCAGAGAAGGGGCTGGAGTGGGTTGCATACATTAGTAGTGGCAGTAGTACCATCTACTATGCAGACACAGTGAAGGGCCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTTCCTGCAAATGACCAGTCTGAGGTCTGAGGACACGGCCATGTATTACTGTGCAAGGCCTCATTACTACGGTAGTAGAGGGTACTTCGATGTCTGGGGCACAGGGACCACGGTAACCGTCTCCTCAG"
tmp$germline_alignment
[1] "CAGCTGGTGGAGTCTGGGGGAGGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTGACTATGGAATGCACTGGGTTCGTCAGGCTCCAGAGAAGGGGCTGGAGTGGGTTGCATACATTAGTAGTGGCAGTAGTACCATCTACTATGCAGACACAGTGAAGGGCCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTTCCTGCAAATGACCAGTCTGAGGTCTGAGGACACGGCCATGTATTACTGTGCAAGGNNNNATTACTACGGTAGTAGNNGGTACTTCGATGTCTGGGGCACAGGGACCACGGTCACCGTCTCCTCAG"
tmp$germline_alignment_d_mask
[1] "CAGCTGGTGGAGTCTGGGGGA...GGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCACTTTC............AGTGACTATGGAATGCACTGGGTTCGTCAGGCTCCAGAGAAGGGGCTGGAGTGGGTTGCATACATTAGTAGTGGC......AGTAGTACCATCTACTATGCAGACACAGTGAAG...GGCCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTTCCTGCAAATGACCAGTCTGAGGTCTGAGGACNNNNNNNNNNNNNNNNNNNNNNGGTACTTCGATGTCTGGGGCACAGGGACCACGGTCACCGTCTCCTCAG"
tmp$mu_freq
[1] 0.607717
tmp$mu_count
[1] 189

And here’s the igblast result from this exact sequence:

And my code to generate my result: I use AssignGenes.py to align and CreateGermlines.py for the germline information

data_h_freq_comb = observedMutations(h_bcr, sequenceColumn = "sequence_alignment",
                            germlineColumn = "germline_alignment_d_mask",
                            regionDefinition = NULL,
                            frequency = T,
                            combine = T)

Any suggestions and help would be appreciated!!! Thanks!

Comments (4)

  1. ssnn

    Hi Shaowen. I noticed that the sequences in sequence_alignment and germline_alignment don’t have IMGT gaps. Immcantation uses IMGT aligned sequences in the *_aligment fields. So I created a fasta file with your tmp$sequence_alignment sequence and repeated your analysis: AssingGenes, CreateGermlines, observedMutations. The mutation frequency I get is 0.1083591.

    The commands I used to run AssignGenes and createGermlines:

    docker run -v $(pwd):/data:z --workdir /data immcantation/suite:4.4.0 AssignGenes.py \
       igblast -s example_sequence.fasta -b /usr/local/share/igblast/ --outdir output
    
    docker run -v $(pwd):/data:z --workdir /data immcantation/suite:4.4.0 \
    MakeDb.py igblast -s example_sequence.fasta -i output/example_sequence_igblast.fmt7  \
    -r /usr/local/share/germlines/imgt/human/vdj \
    --log test.txt --outdir output --format airr
    
    docker run -v $(pwd):/data:z --workdir /data immcantation/suite:4.4.0 \
    CreateGermlines.py -d output/example_sequence_igblast_db-pass.tsv \
    -r /usr/local/share/germlines/imgt/human/vdj \
    -g dmask \
    --format airr \
    --log create-germlines.log
    

    In R:

    library(dplyr)
    db <- airr::read_rearrangement("output/example_sequence_igblast_db-pass_germ-pass.tsv")
    db$sequence                                                                                                                                                        
    [1] "CAGCTGGTGGAGTCTGGGGGAGGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCACTTTCAGTGACTATGGAATGCACTGGGTTCGTCAGGCTCCAGAGAAGGGGCTGGAGTGGGTTGCATACATTAGTAGTGGCAGTAGTACCATCTACTATGCAGACACAGTGAAGGGCCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTTCCTGCAAATGACCAGTCTGAGGTCTGAGGACACGGCCATGTATTACTGTGCAAGGCCTCATTACTACGGTAGTAGAGGGTACTTCGATGTCTGGGGCACAGGGACCACGGTAACCGTCTCCTCAG"
    db$sequence_alignment
    [1] "......CAGCTGGTGGAGTCTGGGGGA...GGCTTAGTGAAGCCTGGAGGGTCCCTGAAACTCTCCTGTGCAGCCTCTGGATTCACTTTC............AGTGACTATGGAATGCACTGGGTTCGTCAGGCTCCAGAGAAGGGGCTGGAGTGGGTTGCATACATTAGTAGTGGC......AGTAGTACCATCTACTATGCAGACACAGTGAAG...GGCCGATTCACCATCTCCAGAGACAATGCCAAGAACACCCTGTTCCTGCAAATGACCAGTCTGAGGTCTGAGGACACGGCCATGTATTACTGTGCAAGGCCTCATTACTACGGTAGTAGAGGGTACTTCGATGTCTGGGGCACAGGGACCACGGTAACCGTCTCCTCAG"
    db$germline_alignment
    [1] "GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTC............AGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGT......AGTAGTACCATATACTACGCAGACTCTGTGAAG...GGCCGATTCACCATCTCCAGAGACAATGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGNNNNNATTACTATGATAGTAGNNNNNNNNNNNNNGTCTGGGGCAAAGGGACCACGGTCACCGTCTCCTCAG"
    db$germline_alignment_d_mask
    [1] "GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTACAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTC............AGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTTTCATACATTAGTAGTAGT......AGTAGTACCATATACTACGCAGACTCTGTGAAG...GGCCGATTCACCATCTCCAGAGACAATGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTCTGGGGCAAAGGGACCACGGTCACCGTCTCCTCAG"
    db %>% select(contains("_call"))
    # A tibble: 1 × 4
      v_call      d_call      j_call   c_call
      <chr>       <chr>       <chr>    <chr> 
    1 IGHV3-48*01 IGHD3-22*01 IGHJ6*04 NA   
    db <- shazam::observedMutations(db, sequenceColumn = "sequence_alignment",
    +                             germlineColumn = "germline_alignment_d_mask",
    +                             regionDefinition = NULL,
    +                             frequency = T,
    +                             combine = T)
    db %>% select(contains("mu_"))
    # A tibble: 1 × 1
      mu_freq
        <dbl>
    1   0.108 
    

  2. Shaowen Jiang reporter

    Hi, Sorry for the late response, I forgot to mention that this sequence is from mouse not human. But thanks fro the workflow, I will try on that.

    I have another issue that is not quite related to this question, but if you don’t mind, I will just post here.

    I noticed that in the IMGT database from the docker container, the TRBC1 constant sequence is missing. Is that a reason to skip TRBC1?

    less /usr/local/share/germlines/imgt/mouse/imgt_mouse_TRBC.fasta
    

    And inside has only 2 sequences:

    >M26057+M26058+M26059+M26060|TRBC2*01|Mus_musculus_B10.A|F|EX1+EX2+EX3+EX4|M26057:63..437;M26058:48..65;M26059:29..135;M26060:78..95|519 nt|1|+1| | | |519+0=519| | |
    naggatctgagaaatgtgactccacccaaggtctccttgtttgagccatcaaaagcagag
    attgcaaacaaacaaaaggctaccctcgtgtgcttggccaggggcttcttccctgaccac
    gtggagctgagctggtgggtgaatggcaaggaggtccacagtggggtcagcacggaccct
    caggcctacaaggagagcaattatagctactgcctgagcagccgcctgagggtctctgct
    accttctggcacaatcctcgaaaccacttccgctgccaagtgcagttccatgggctttca
    gaggaggacaagtggccagagggctcacccaaacctgtcacacagaacatcagtgcagag
    gcctggggccgagcagactgtggaatcacttcagcatcctatcatcagggggttctgtct
    gcaaccatcctctatgagatcctactggggaaggccaccctatatgctgtgctggtcagt
    ggcctagtgctgatggccatggtcaagaaaaaaaattcc
    >AE000665|TRBC2*03|Mus_musculus_129|F|EX1+EX2+EX3+EX4|166812..167186+167692..167709+167854..167960+168243..168260|519 nt|1|+1| | | |519+0=519| | |
    naggatctgagaaatgtgactccacccaaggtctccttgtttgagccatcaaaagcagag
    attgcaaacaaacaaaaggctaccctcgtgtgcttggccaggggcttcttccctgaccac
    gtggagctgagctggtgggtgaatggcaaggaggtccacagtggggtcagcacggaccct
    caggcctacaaggagagcaattatagctactgcctgagcagccgcctgagggtctctgct
    accttctggcacaatcctcgaaaccacttccgctgccaagtgcagttccatgggctttca
    gaggaggacaagtggccagagggctcacccaaacctgtcacacagaacatcagtgcagag
    gcctggggccgagcagactgtggaatcacttcagcatcctatcatcagggggttctgtct
    gcaaccatcctctatgagatcctactggggaaggccaccctatatgctgtgctggtcagt
    ggcctggtgctgatggccatggtcaagaaaaaaaattcc
    

    Thanks!

    Shaowen

  3. Log in to comment