weird warning message when running `reassignAlleles()`

Issue #14 resolved
Julian Zhou created an issue

Have not had this before for any other subjects. Only this subject. Message not helpful either..

> vCallGeno = reassignAlleles(clip_db=db, genotype_db=genoSeqs, v_call="V_CALL", 
+                                 method="hamming", keep_gene=TRUE)

Warning message:
In V_CALL_GENOTYPED[ind] = sapply(best_alleles, paste, collapse = ",") :
  number of items to replace is not a multiple of replacement length

Comments (3)

  1. Julian Zhou reporter

    I'll try to explain what I think happened:

    From reassignAlleles():

    for (het_gene in hetero_genes) {
                ind = which(v_genes %in% het_gene)
                if (length(ind) > 0) {
                    het_alleles = names(geno_genes[which(geno_genes == 
                      het_gene)])
                    het_seqs = genotype_db[het_alleles]
                    if (method == "hamming") {
                      dists = lapply(het_seqs, function(x) sapply(getMutatedPositions(v_sequences[ind], 
                        x, match_instead = FALSE), length))
                      dist_mat = matrix(unlist(dists), ncol = length(het_seqs))
                    }
                    else {
                      stop("Only Hamming distance is currently supported as a method.")
                    }
                    best_match = apply(dist_mat, 1, function(x) which(x == 
                      min(x)))
                    best_alleles = sapply(best_match, function(x) het_alleles[x])
                    V_CALL_GENOTYPED[ind] = sapply(best_alleles, 
                      paste, collapse = ",")
                }
            }
    

    dist_mat appears to always be a matrix, even when its nrow is 1. So dist_mat is not a problem.

    The last 3 steps involving best_match, best_alleles, and V_CALL_GENOTYPES[ind] rely on 2 scenarios.

    Scenario 1

    • dist_mat has nrow >= 1

    • min(dist_mat[i, ]) is a single value for all i, thus best_match is a vector

    • best_alleles is a vector, with each entry being a single allele

    • each slot in V_CALL_GENOTYPED[ind] gets assigned a single entry from best_alleles

    Scenario 2

    • dist_mat has nrow > 1

    • min(dist_mat[i, ]) returns multiple values for some i, thereby rendering best_match as a list

    • best_alleles is a list, with some entries containing a vector of multiple alleles

    • sapply(best_alleles, paste, collapse = ",") works as a de facto lapply and concatenates the multiple alleles in the best_alleles entries

    However, this does not account for a third scenario.

    Scenario 3

    • dist_mat has nrow = 1

    • min(dist_mat[i, ]) returns multiple values for i=1. In this case, R would coerce best_match from sapply into a single-column, multi-row matrix.

    • best_alleles becomes a single vector of multiple alleles

    • sapply(best_alleles, paste, collapse = ",") does NOT concatenate the multiple alleles in best_alleles together, unlike intended.

    • This creates a situation where ind provides a single slot, whereas sapply(best_alleles, paste, collapse = ",") provides multiple values.

    The roots of the problem lies in that R does not always data structure unmutable, especially in scenarios such as that above where a matrix has row dimension of 1.

    A more comprehensive fix would be to switch to unmutable data structures provided by the likes of H Wickham's tibble package, alas I'll leave that to future heros/heroins to come.

    I provide a quick albeit less elegant fix by explicitly specifying a list data structure throughout the affected steps, and keeping that data structure as a list by using lapply instead of sapply or apply.

  2. Log in to comment