Clonal diversity

Issue #100 resolved
Benjamaporn Sriwilai created an issue

Can alakazam be used to quantify clonal diversity after clonal assignment from other packages such as Scirpy ?

**Data: Single-cell BCR sequences (scBCR-seq)

Comments (13)

  1. Jason Vander Heiden

    Greetings @Benjamaporn Sriwilai , Yes, it can. There shouldn’t be any problem with that as long as the AIRR tsv contains a column for the clonal assignment IDs. If the clone IDs are in a non-standard column, you’ll just need to specify the column name via the clone argument to alphaDiversity (defaults to clone = "clone_id")

  2. Benjamaporn Sriwilai reporter

    Thanks! So, should each row of the input represent one cell ? Because, my clonal assignment IDs are defined by “both heavy and light chain”. I wonder its during observing the smallest number of sequences in alphaDiversity function, so in my situation, it observe the smallest number of “cell”, right ?

  3. Jason Vander Heiden

    Yes, you’ll want to subset to 1 row per cell with the default parameters. I do this by subsetting to locus="IGH” because of how clones are defined, but whatever is appropriate for your data is fine. The default behavior is to treat each row as an observation. However, if you have a summary table (count of cells per clone) you can also specify the copy argument to define the column with clonal abundance counts, instead of treating each row as a cell.

    I just looked at the docs and some of this is hidden in the ... which is passed to estimateAbundance and is therefore documented in that function. Which is a bit confusing, given that clone, copy, and group are pretty important arguments. I think we can fix that.

  4. Benjamaporn Sriwilai reporter

    Thank you so much, I will choose the most appropiate parameter to my data. Next, I am curious about alphaDiversity at q=1. In your coding, you use D <- sapply(q, function(x) sum(p^x)^(1 / (1 - x))) to calculate diversity index, right? However, we cannot use this equation to calculate it at q=1, so you use q=0.9999 to approximate the diversity index of q=1. After that, I try to check the result manually calculated by the mentioned equation at q=0.9999, but the results represent the very small value of D, and unequal to the result from alphaDiversity function (q=1). Why is it ? Feel free to let me know if I am missunderstanding.

  5. Jason Vander Heiden

    There’s a few other steps that happen in alphaDiversity. It’s first passed to estimateAbundance to infer unobserved clonal abundance, smooth the distribution, and uniformly downsample all of the input groups so they can be compared. You should see some decrease in D compared to a raw calculation on an uncorrected abundance distribution, but it shouldn’t be major. If you’re seeing a large increase it is likely the downsampling, particularly if you have one group with small number of observations. That’s the first thing I would check. If so, you can correct it my upping the min_n value, which will selectively filter out groups with too few observations. The diversity calculations tend to suffer from discreteness artifacts below ~1000 observations anyway. (I can’t recall why we set the default at min_n=30 though - that seems rather low.)

    And, yes, q=0.9999 is because of the division by zero problem at q=1.

  6. Benjamaporn Sriwilai reporter

    So, the D value at q = 0.9999 is then calculated following this equation, right ?

  7. Jason Vander Heiden

    It should be pretty close, but not identical because we just plug 0.9999 into the Hill calculation instead of using the Shannon entropy formula:

    > p <- c(1/15, 1/15, 1/5, 2/3)
    > q <- .9999
    > sapply(q, function(x) sum(p^x)^(1 / (1 - x)))
    [1] 2.594272
    > exp(-sum(p*log(p)))
    [1] 2.594181
    

  8. Benjamaporn Sriwilai reporter

    Dear Jason,

    I have a more question. If I estimate abundance by choosing your default (copy=null), and my data structure is “one row represents one cell”. During resampling by bootstrapping, n is then drawn from the estimate complete abundance distribution. So, in my situation I draw n from all cells (detected and unseen cell). Is it correct ? Finally, I can calculate relative abundance of them (clone_id).

  9. Log in to comment