getTrees error: Tibble columns must have compatible sizes

Issue #4 resolved
Bo Sun created an issue

Hi Ken and team Immcantation,

Thanks for the brilliant tool! I am running into a recurrent error with the getTrees function after formatting my clones.

There seems to be a discrepancy in the nibble columns vs clones. It seems to come from the Germline column and the the column containing the sequence_ids. I have no duplicated sequence_ids so I am a little uncertain where the discrepancy is coming from. Any help would be great.

Comments (18)

  1. Kenneth Hoehn

    There were some sequences that were longer than the reconstructed germlines. Couldn’t figure out a general solution without access to the data input though. Were you having a similar issue?

  2. Kelly Dew-Budd

    I am having the same issue. Is there a way for me to send you the data that produces the error? I have 5 sequences, part of 2 families. One produces the error the other doesn’t.

  3. Kellie MacPhee

    Hi! I am also seeing this issue. It seems to happen when a clone has aligned sequences of differing lengths (i.e. germline and sequence alignment columns have the same length for each row, but different rows may have different lengths). During formatClones, sequences are padded with Ns but the germline sequences are not also padded with Ns in the same way, so the germline sequences end up being shorter. I can’t share my data for privacy reasons, but here is an example to illustrate what I mean.

    library(dowser)
    
    data(ExampleAirr)
    ExampleAirr = ExampleAirr[ExampleAirr$clone_id %in% c("3170"),]
    
    # modify data so that rows have different sequence lengths
    # but, in each row the germline and sequence alignments are equal length
    ExampleAirr$sequence_alignment[1] = paste0(ExampleAirr$sequence_alignment[1], 'CTA')
    ExampleAirr$germline_alignment[1] = paste0(ExampleAirr$germline_alignment[1], 'CTA')
    nchar(ExampleAirr$sequence_alignment)
    nchar(ExampleAirr$germline_alignment)
    
    clones = formatClones(ExampleAirr)
    trees = getTrees(clones, build="pml")
    

  4. Kenneth Hoehn

    Hi Kellie. Thanks for the reproducible example! I’m trying to understand what would cause this and the best place to implement a solution. You created the germline_alignment_d_mask column with createGermlines, correct?

  5. Kellie MacPhee

    I actually already have the germline sequences from IMGT, so I’m not using createGermlines in dowser. I'm trying to use the full heavy chain sequence alignments from IMGT to construct trees, and it seems like there’s more variation in the length of those versus just using the v segment alignment or some other subsequence. It also seems like the amino acid sequence lengths are more consistent than the nucleotide sequence lengths, but I need to use nucleotide sequences.

  6. Kenneth Hoehn

    Huh. Does IMGT create a clonal germline for each sequence individually or a consensus germline for each clonal family? For trees in dowser, all sequences in each clone need to have the same germline sequence. This is what createGermlines does. You’ve clustered the BCR sequences into clonal families, right?

  7. Kellie MacPhee

    It’s for each sequence individually. We run IMGT first, then use Change-O to cluster into clonal families, so I don’t think IMGT has any information about the clonal families. I was wondering if that might be part of the issue, I guess for other clonotypes with multiple germline sequences that I feed into dowser (which do generate trees successfully), dowser is silently taking one of the germline sequences as the point of reference? Maybe the answer here is just to run a check on the input data, with a descriptive error raised when people are feeding in multiple germline sequences and sequences of differing lengths.

  8. Kellie MacPhee

    I think I might have misunderstood the purpose of createGermlines before. Does it take existing germline sequences and create a consensus from them? Or does it just require the sequences from one clonal family, plus maybe the V and J calls, and reconstructs germlines from there? It’s a little unclear to me in the createGermlines documentation whether all of the column names that are input variables have to be present in the input AIRR data, or if some of those are just names for the output.

  9. Kenneth Hoehn

    Right - it’s important for the phylogenetic analyses that the sequences within each clone have the same germline. If you use the createGermlines function in dowser or the CreateGermlines.py script in ChangeO, this should fix the issue. Looking at the code, Dowser just grabs the first germline within a clone, since it assumes this step has been done. I think you’re correct that this should have a check on the input in case that step is skipped.

  10. Kellie MacPhee

    Makes sense! Thanks Kenneth. I’ll add a createGermlines call to my code and hopefully that will resolve everything.

  11. Kenneth Hoehn

    createGermlines will create a new consensus germline based on the majority V call, J call, and sequence length within a clone. The germline junction region is usually masked with Ns because it is difficult to infer accurately.

  12. Kenneth Hoehn

    Added some more informative error messages to the most recent development version, which you can install with:

    devtools::install_bitbucket("kleinstein/dowser")
    

    You can see if this is what’s causing your issue. Thanks for the feedback - hopefully this will resolve the issue for future users!

  13. Kelly Dew-Budd

    Thanks for troubleshooting this. I was using the createGermlines.py script prior to Dowser and getting this error. Adding the --cloned argument fixed the issue.

  14. Kenneth Hoehn

    Great - glad you were able to find the issue. The createGermlines function in Dowser is functionally equivalent to the CreateGermlines.py script with `--cloned ` option. Either should work.

  15. Log in to comment