Function plotNovel() found nv NA/NaN in Introductory Webinar and Jupyter Notebook

Issue #173 resolved
Fubaoqian Huang created an issue

Hello! I am trying to learn the Immcantation Tutorials, and I found the Introductory Webinar and Jupyter Notebook and the example input file (input.fasta) . However, I got KeyError: 'V Frame shift' , so I installed the  development version of changeo according to the Issue. And I installed other R packages by install.packages(c("alakazam", "shazam", "tigger", "scoper")).

Everything was OK until I test the Identify potentially novel V gene alleles part in the Introductory Webinar and Jupyter Notebook. I found the nv contains no Novel allele. Thus when I run plotNovel() I got another error. Here are the codes:

suppressPackageStartupMessages(library(alakazam))

suppressPackageStartupMessages(library(tigger))

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(airr))

dir.create("results/tigger")

db <- read_airr("results/changeo/data_ph_parse-select.tsv")

|===================================================================| 100% 92 MB

colnames(db)

[1] "sequence_id" "sequence" "rev_comp"
[4] "productive" "v_call" "d_call"
[7] "j_call" "sequence_alignment" "germline_alignment"
[10] "junction" "junction_aa" "v_cigar"
[13] "d_cigar" "j_cigar" "stop_codon"
[16] "vj_in_frame" "locus" "junction_length"
[19] "np1_length" "np2_length" "v_sequence_start"
[22] "v_sequence_end" "v_germline_start" "v_germline_end"
[25] "d_sequence_start" "d_sequence_end" "d_germline_start"
[28] "d_germline_end" "j_sequence_start" "j_sequence_end"
[31] "j_germline_start" "j_germline_end" "isotype"

ighv <- readIgFasta("/hw5/data/imgt/human/vdj/imgt_human_IGHV.fasta")

ighv[1]

IGHV1-18*01

"CAGGTTCAGCTGGTGCAGTCTGGAGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTT............ACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTAC......AATGGTAACACAAACTATGCACAGAAGCTCCAG...GGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA"

nv <- findNovelAlleles(db, germline_db=ighv, nproc=8)

selectNovel(nv)

[1] germline_call note
[3] polymorphism_call nt_substitutions
[5] novel_imgt novel_imgt_count
[7] novel_imgt_unique_j novel_imgt_unique_cdr3
[9] perfect_match_count perfect_match_freq
[11] germline_call_count germline_call_freq
[13] mut_min mut_max
[15] mut_pass_count germline_imgt
[17] germline_imgt_count pos_min
[19] pos_max y_intercept
[21] y_intercept_pass snp_pass
[23] unmutated_count unmutated_freq
[25] unmutated_snp_j_gene_length_count snp_min_seqs_j_max_pass
[27] alpha min_seqs
[29] j_max min_frac
<0 rows> (or 0-length row.names)

pdf("results/tigger/plot1.pdf")
plotNovel(db, selectNovel(nv)[1,])
Error in novel_row$pos_min:novel_row$pos_max : NA/NaN argument
dev.off()

head(nv)
germline_call
1 IGHV1-2*02
2 IGHV1-2*04
3 IGHV1-3*01
4 IGHV1-3*01
5 IGHV1-8*01
6 IGHV1-8*03

note
1 No positions pass y-intercept test.
2 No positions pass y-intercept test.
3 No positions pass y-intercept test.
4 Position(s) passed y-intercept (9,60,86,117,144,173,190,191,196,232,255,259,263,270,303) but no unmutated versions of novel allele found.
5 No positions pass y-intercept test.
6 No positions pass y-intercept test.
polymorphism_call nt_substitutions novel_imgt novel_imgt_count
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
novel_imgt_unique_j novel_imgt_unique_cdr3 perfect_match_count
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA

perfect_match_freq germline_call_count germline_call_freq mut_min mut_max
1 NA 2485 0.038 1 10
2 NA 758 0.012 1 10
3 NA 703 0.011 1 10
4 NA 703 0.011 29 38
5 NA 1083 0.017 1 10
6 NA 664 0.010 1 10
mut_pass_count
1 336
2 77
3 67
4 123
5 79
6 61
germline_imgt
1 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTC............ACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAAC......AGTGGTGGCACAAACTATGCACAGAAGTTTCAG...GGCAGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA
2 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTC............ACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAAC......AGTGGTGGCACAAACTATGCACAGAAGTTTCAG...GGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA
3 CAGGTCCAGCTTGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCTTCTGGATACACCTTC............ACTAGCTATGCTATGCATTGGGTGCGCCAGGCCCCCGGACAAAGGCTTGAGTGGATGGGATGGATCAACGCTGGC......AATGGTAACACAAAATATTCACAGAAGTTCCAG...GGCAGAGTCACCATTACCAGGGACACATCCGCGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAAGACACGGCTGTGTATTACTGTGCGAGAGA
4 CAGGTCCAGCTTGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTTTCCTGCAAGGCTTCTGGATACACCTTC............ACTAGCTATGCTATGCATTGGGTGCGCCAGGCCCCCGGACAAAGGCTTGAGTGGATGGGATGGATCAACGCTGGC......AATGGTAACACAAAATATTCACAGAAGTTCCAG...GGCAGAGTCACCATTACCAGGGACACATCCGCGAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAAGACACGGCTGTGTATTACTGTGCGAGAGA
5 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTC............ACCAGTTATGATATCAACTGGGTGCGACAGGCCACTGGACAAGGGCTTGAGTGGATGGGATGGATGAACCCTAAC......AGTGGTAACACAGGCTATGCACAGAAGTTCCAG...GGCAGAGTCACCATGACCAGGAACACCTCCATAAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGG
6 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTC............ACCAGCTATGATATCAACTGGGTGCGACAGGCCACTGGACAAGGGCTTGAGTGGATGGGATGGATGAACCCTAAC......AGTGGTAACACAGGCTATGCACAGAAGTTCCAG...GGCAGAGTCACCATTACCAGGAACACCTCCATAAGCACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCGAGAGG
germline_imgt_count pos_min pos_max y_intercept y_intercept_pass snp_pass
1 1161 1 312 0.125 0 NA
2 512 1 312 0.125 0 NA
3 328 1 312 0.125 0 NA
4 328 1 312 0.125 15 76
5 791 1 312 0.125 0 NA
6 432 1 312 0.125 0 NA
unmutated_count unmutated_freq unmutated_snp_j_gene_length_count
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 0 0 0
5 NA NA NA
6 NA NA NA
snp_min_seqs_j_max_pass alpha min_seqs j_max min_frac
1 NA 0.05 50 0.15 0.75
2 NA 0.05 50 0.15 0.75
3 NA 0.05 50 0.15 0.75
4 NA 0.05 50 0.15 0.75
5 NA 0.05 50 0.15 0.75
6 NA 0.05 50 0.15 0.75

Any ideas or help would be greatly appreciated!

The attachments was about IGHV, which might be helpful to solve the problem.

Comments (19)

  1. Fubaoqian Huang reporter

    Here is the head of the fasta file .

    $head input.fasta

    seq1|ISOTYPE=IGHM
    nnnnnnnnnnnnnnnnnnnnnnnagtgtcaggtgcagctggtggagctggggagcgtggt
    ccagcctgggaggccctgagtctctcctgtgcagcctctggattcaccttcagtctctat
    gctatgcactgggtccgccaggctccaggcaaggggctggagtgggtggcagttatatca
    tatcatgtaagcagtaaatactacgcagactccgtgaagggccgattcaccatctccaga
    gacaattccaagaacacgctgtatctgcaaatgaacagcctgagagctgaggacacggct
    gtgtattactgtgcgagagggccctatagtactggttattactacgagcttgactactgg
    ggccagggaacgctcggtcacccgtctcctcacgggtagtgcatccgccccaaccc
    seq2|ISOTYPE=IGHM
    nnnnnnnnnnnnnnnnnntgtcccaggtgcagctgcaggagtcgggcccaggactggtga

  2. ssnn

    I have not been able to reproduce the error with ExampleDb, after removing all sequences with the allele IGHV1-8*02, and producing 0 novels alleles.

    Does the same error happen with other rows in nv, or just with the first one?

    Why does your data use the germline name IGHV1-2_02_ in nv[1,]? Is this just a typo introduced during copy/paste to create the issues, or is this present in your data?

    Can you share your db so I can reproduce the error? I can set up a secure file transfer request if the file is too big, or if you’d rather not share it here. If so, please send me a message to immcantation@googlegroups.com, with the email you want to use.

  3. Fubaoqian Huang reporter

    Thank you for your reply.

    The same error happen with other rows in nv, maybe due to producing no novels alleles because I found the selectNovel(nv) step was inconsistent with the Jupyter Notebook.

    I’m sorry that it was  just a mistake introduced during copy/paste about the germline name, indeed. I have corrected that.

    And I have upload the db file in the attachments (db.zip). Actually I produced the file according to the Jupyter Notebook, And I create the igblast/imgt reference according to the following codes (downloaded from this website ):
    #Import IgBLAST and IMGT germline databases

    fetch_igblastdb.sh -o /data/igblast
    fetch_imgtdb.sh -o /data/imgt
    imgt2igblast.sh -i /data/imgt -o /data/igblast

    Here is the igblast version information:
    igblastn: 1.16.0
    Package: igblast 1.17.0, build Oct 6 2020 11:26:37

  4. ssnn

    Thanks for the files. I don’t get the error. I am using the most recent version of tigger, 1.0.0.999, installed from the repo. Please upgrade and let me know if that works for you.

  5. Fubaoqian Huang reporter

    Hi, I used the tigger (1.0.0) to run the codes in the html you provided above, and I dont’t get any errors either. Maybe I found the key point that I used plotNovel(db, selectNovel(nv)[1,]) for visualization while the selectNovel(nv) is empty. When I used plotNovel(db, nv[i,]), no error occured. Does it mean that No Novel allele found ?

    However, I wonder why selectNovel(nv) wasn’t empty in the Jupyter Notebook on the official website and therefore, using plotNovel(db, selectNovel(nv)[1,]) is OK. Is it due to the database/reference, packages version or parameter issues ?

    Anyway, thank you so much again.

  6. ssnn

    selectNovel(nv) is empty when tigger doesn’t find new polymorphisms that pass all the thresholds. If there are novel alleles, the second panel in the figure from plotNovel, shows the nucleotide usage at each of the polymorphic positions as a function of sequence-wide mutation count. If there are no novel alleles (`selectNovel(nv)` is empty), in recent versions of tigger, you can still use plotNovel, to visualize your data, and then the second panel shows the mutation count. The Jupyter Notebook is a tutorial, and uses a container, data and reference germlines for demonstration purposes only. It has been prepared to be able to find a novel allele. In particular, allele IGHV3-20*04 has been removed from the reference germline database, and is ‘discovered’ as IGHV3-20*01_C307T.

  7. Log in to comment