Add readAIRR() and update readChangeoDb(). They should accept a vector of files and check uniqueness

Issue #92 new
ssnn created an issue

Update alakazam::readChangeoDb's file argument to accept a vector and behave accordingly if length(file) > 1.

For populating sample_id, require file to be a named vector and use the names for the sample_id, with some argument to control whether this action is performed. Eg:

  • samples = NULL: leave `sample_id` alone
  • samples = "auto": populate `sample_id` from names(file). Use numbers if is.null(names(file)).
  • samples = character vector where length(samples) == length(files): use these as sample_id values for each file.

uniquify=TRUE could either append a number or the sample_id to the relevant columns.  You'll probably want to also check for, and fix, uniqueness within a single file (could just be a numeric suffix in that case).

Make a alakazam::readAIRR, which wraps airr::read_rearrangement and does all the same stuff.

You should be able to abstract the AIRR/Changeo functions too. As the only difference seems to be column names and readr::read_tsv vs airr::read_rearrangement.

Comments (2)

  1. Jason Vander Heiden

    This is currently what I do, in case it helps:

    # Imports
    library(tidyverse)
    library(alakazam)
    library(airr)
    
    # Find AIRR files
    files <- dir(DATA_PATH, pattern="db-pass.tsv", full.names=T, recursive=T)
    names(files) <- str_extract(files, "SAM\\d+")
    
    # Load AIRR files
    db <- lapply(files, read_rearrangement) %>%
        bind_rows(.id="sample_id") %>%
        mutate(sequence_id=paste(sequence_id, sample_id, sep="_"),
               cell_id=paste(cell_id, sample_id, sep="_"),
               chain=getChain(locus))
    
    # Filter to productive and remove doublets
    db <- db %>%
        filter(productive, !is.na(chain)) %>%
        group_by(sample_id, cell_id, chain) %>%
        dplyr::slice(which.max(umi_count))
    

  2. Log in to comment