Document use of collapseDuplicates outside of makeChangeClone

Issue #76 new
Susanna Marquez created an issue

We should document how to use collapseDuplicates outside makeChangeoClone. Adding a grouping option to collapseDuplicates would be useful, to group by v, j and junction length.

Comments (5)

  1. Jason Vander Heiden

    My opinion is that documenting is a good idea, but that we should move away from using `alakazam::collapseDuplicates` for generic duplicate removal, because its purpose is tree preprocessing. I don’t think we should add grouping.

    We should really have a tool in changeo to handle general duplicate removal tasks with a different set of assumptions.

  2. Jason Vander Heiden

    I suppose if you wanted to keep this use case in R, you split `collapseDuplicates` into separate functions. One that requires grouping columns and returns a full data.frame of the input and another that does the current lineage preprocessing single-clone approach (the first wrapping the second). Then you can add features to the grouped version that break the tree preprocessing function and avoid having multiple return behaviors in a single function.

  3. Susanna Marquez reporter

    The quick and non parallel version I was thinking:

    collapseDuplicates <- function(data, id="SEQUENCE_ID", seq="SEQUENCE_IMGT",
                                   text_fields=NULL, num_fields=NULL, seq_fields=NULL,
                                   add_count=FALSE, ignore=c("N", "-", ".", "?"), 
                                   sep=",", dry=FALSE, verbose=FALSE, group=NULL) {
        if (!is.null(group)) {
          data %>%
            group_by(.dots=group) %>%
            do(collapseDuplicates(., id=id, seq=seq,
                                  text_fields=text_fields, num_fields=num_fields, 
                                  seq_fields=seq_fields,
                                  add_count=add_count, ignore=ignore, 
                                  sep=seq, dry=dry, verbose=verbose, group=NULL ))
        } else {
           # here the rest of the code
        }
    }
    ```
    
  4. Jason Vander Heiden

    The problem is that the return behavior with and without group will be different and the assumptions about the requirements of identical length and clonal relatedness are different. For example, you may prefer to keep rather than merge different dissimilar annotations, you may want to resolve non-identical length sequences differently, etc. Because collapseDuplicates is so fragile - any change to how ambiguous characters are handled and distances are calculated can introduce 0 length branches into the tree and break the lineage methods - you can’t really change how any of that works so it makes it difficult to work with in the general case.

    For a quick and dirty implementation, I would make a separate function, let's say collapseDb for the sake of argument, where groups is a vector so you can pass it multiple grouping columns (if cloning hasn't been done, you want to uniqueness retained within sample/isotype, etc). I would make it required and have it default to either c("V_CALL", "J_CALL", "C_CALL", "JUNCTION_LENGTH") or "CLONE”. I would also copy in the masking arguments from makeChangeoClone and add those preprocessing steps to each grouped data.frame. I'd also remove seq_fields, dry and verbose and make add_count=TRUE by default. That I think will sort of mirror changeo.collapseSeq and shazam::distToNearest most closely.

    But, I still think this isn’t the right tool for the job and time is better spent implementing a real solution in changeo than continuing to add duct tape to this one…

  5. Jason Vander Heiden

    Oh, and I'd also add the result of stringi::stri_length("SEQUENCE_IMGT") to your grouping variables internally. Where will depend on how the masking arguments get implemented.

  6. Log in to comment