getTrees error: Tibble columns must have compatible sizes
Hi Ken and team Immcantation,
Thanks for the brilliant tool! I am running into a recurrent error with the getTrees function after formatting my clones.
There seems to be a discrepancy in the nibble columns vs clones. It seems to come from the Germline column and the the column containing the sequence_ids. I have no duplicated sequence_ids so I am a little uncertain where the discrepancy is coming from. Any help would be great.
Comments (18)
-
-
Were you able to solve this issue?
-
There were some sequences that were longer than the reconstructed germlines. Couldn’t figure out a general solution without access to the data input though. Were you having a similar issue?
-
I am having the same issue. Is there a way for me to send you the data that produces the error? I have 5 sequences, part of 2 families. One produces the error the other doesn’t.
-
Yes that would be helpful, as well as the commands you’re running. You can email it to immcantation@googlegroups.com
-
Hi! I am also seeing this issue. It seems to happen when a clone has aligned sequences of differing lengths (i.e. germline and sequence alignment columns have the same length for each row, but different rows may have different lengths). During formatClones, sequences are padded with Ns but the germline sequences are not also padded with Ns in the same way, so the germline sequences end up being shorter. I can’t share my data for privacy reasons, but here is an example to illustrate what I mean.
library(dowser) data(ExampleAirr) ExampleAirr = ExampleAirr[ExampleAirr$clone_id %in% c("3170"),] # modify data so that rows have different sequence lengths # but, in each row the germline and sequence alignments are equal length ExampleAirr$sequence_alignment[1] = paste0(ExampleAirr$sequence_alignment[1], 'CTA') ExampleAirr$germline_alignment[1] = paste0(ExampleAirr$germline_alignment[1], 'CTA') nchar(ExampleAirr$sequence_alignment) nchar(ExampleAirr$germline_alignment) clones = formatClones(ExampleAirr) trees = getTrees(clones, build="pml")
-
Hi Kellie. Thanks for the reproducible example! I’m trying to understand what would cause this and the best place to implement a solution. You created the germline_alignment_d_mask column with createGermlines, correct?
-
I actually already have the germline sequences from IMGT, so I’m not using createGermlines in dowser. I'm trying to use the full heavy chain sequence alignments from IMGT to construct trees, and it seems like there’s more variation in the length of those versus just using the v segment alignment or some other subsequence. It also seems like the amino acid sequence lengths are more consistent than the nucleotide sequence lengths, but I need to use nucleotide sequences.
-
Huh. Does IMGT create a clonal germline for each sequence individually or a consensus germline for each clonal family? For trees in dowser, all sequences in each clone need to have the same germline sequence. This is what createGermlines does. You’ve clustered the BCR sequences into clonal families, right?
-
It’s for each sequence individually. We run IMGT first, then use Change-O to cluster into clonal families, so I don’t think IMGT has any information about the clonal families. I was wondering if that might be part of the issue, I guess for other clonotypes with multiple germline sequences that I feed into dowser (which do generate trees successfully), dowser is silently taking one of the germline sequences as the point of reference? Maybe the answer here is just to run a check on the input data, with a descriptive error raised when people are feeding in multiple germline sequences and sequences of differing lengths.
-
I think I might have misunderstood the purpose of createGermlines before. Does it take existing germline sequences and create a consensus from them? Or does it just require the sequences from one clonal family, plus maybe the V and J calls, and reconstructs germlines from there? It’s a little unclear to me in the createGermlines documentation whether all of the column names that are input variables have to be present in the input AIRR data, or if some of those are just names for the output.
-
Right - it’s important for the phylogenetic analyses that the sequences within each clone have the same germline. If you use the createGermlines function in dowser or the CreateGermlines.py script in ChangeO, this should fix the issue. Looking at the code, Dowser just grabs the first germline within a clone, since it assumes this step has been done. I think you’re correct that this should have a check on the input in case that step is skipped.
-
Makes sense! Thanks Kenneth. I’ll add a createGermlines call to my code and hopefully that will resolve everything.
-
createGermlines will create a new consensus germline based on the majority V call, J call, and sequence length within a clone. The germline junction region is usually masked with Ns because it is difficult to infer accurately.
-
Added some more informative error messages to the most recent development version, which you can install with:
devtools::install_bitbucket("kleinstein/dowser")
You can see if this is what’s causing your issue. Thanks for the feedback - hopefully this will resolve the issue for future users!
-
Thanks for troubleshooting this. I was using the createGermlines.py script prior to Dowser and getting this error. Adding the
--cloned
argument fixed the issue. -
Great - glad you were able to find the issue. The createGermlines function in Dowser is functionally equivalent to the CreateGermlines.py script with `--cloned ` option. Either should work.
-
- changed status to resolved
Marking as resolved, new changes being submitted to CRAN soon. Feel free to re-open if still an issue.
- Log in to comment
Hi Bo. Glad you’ve found Dowser useful! Do you have an example dataset that could be used to reproduce this error? You can send to immcantation@googlegroups.com if that’s preferable to posting it here.