1. Jonathan Friedman
  2. SparCC
  3. Issues
Issue #2 resolved

Existing Network Generation Script?

Erik Kastman
created an issue

Hi Jonathan,

Thanks for putting this together - looks great. Out of curiosity, do you have any out-of-the-box scripts for converting the correlation matrices into Edge and Node tables that could be loaded into cytoscape? I'm thinking of the graphs shown in the figures in the Sparcc Plos compbio 2012 paper (doi:10.1371/journal.pcbi.1002687 ).

I'm sure that pulling these out is trivial, but if something exists I'd rather ask and not reinvent the wheel. Thanks in advance,

Erik Kastman

Comments (11)

  1. Jonathan Friedman repo owner

    Hi Erik,

    To generate the figures for the Plos compbio paper I actually generated a networkx network object, which I've plotted using custom scripts. So I never needed export to anything else.

    That said, allowing an easy way of converting the correlation/p-value tables into a network is a good idea. My current thought is to add functionality for easily creating a networkx object, which can be exported in most major network formats. Also, I'll likely add such functionality yo PySurvey, which is a more extensive an up-to-date package.

    Cheers, Jonathan

  2. Erik Kastman reporter

    A built-in networkx export would definitely be useful. Seems like the MatrixDictionary is commented well enough that I could give it a try if I get some time, or some framework code in the docs would also be very helpful.

    I'll check out PySurvey, though having Sparcc as a separate project made it easy to pick up quickly. Are the Sparcc functions the correlations listed in Analysis Methods, or are they not yet exposed through the PySurvey API yet?

    Also, I should probably ask this at the PySurvey issues, but I have a question about the format. The docs say:

    Most pysurvey functions operate on a matrix containing counts or fractions of a set of components over a set of samples. The standard convention is to have rows correspond to samples and columns to components (e.g. OTUs)."

    However, example/fake_data.txt distributed with Sparcc is transposed (rows are OTUs and columns are Samples), and that's the same format that you get from the output of biom convert. Are those just outliers, or am I missing something?

    Thanks again for your help,

    Erik

  3. Jonathan Friedman repo owner

    The correlation function just computes the "normal" correlations. The SparCC correlations are given by basis_corr. Also, you can import the SparCC module from pysurvey, which contains functions for computing correlations, making permutations, and computing p-values:

    from pysurvey import SparCC
    

    Most of the code for making these usable as commandline tools is already in place, but is not 100% ready yet.

    As for the data format, representing variables as colums and observations as rows is pretty standard in multivariate analysis, and most implementations (in C, R, matlab...) expect that format. In the context of 16S surveys, when the data is stored in text files, this convention is usually inverted. The reason is that 16S surveys have two pretty unusual features: i) there are many more variables (OTUs) then observations (samples); ii) there is metadata for variables which is included (lineage information for OTUs).

    I decided to have the datastructure holding the data be in standard form (rows=samples, cols=OTUs), so that it could be passed on to external analysis function as is. So, when the OTU table is imported it is transposed by default (and the first few col/row labels are printed out to warn the user about that). However, this is a contant source of confusion, so I guess it should be made clearer in the docs.

    Cheers,

    Jonathan

  4. Erik Kastman reporter

    Hey Jonathan,

    Got it - makes sense. A few lines about the 16S survey features in the docs would be very helpful.

    Sorry, but I'm still confused about where the sample data in fake_data.txt (which is 16S format rows=OTUs,cols=samples) gets transposed, though. There is an option to transpose the matrix in SurveyMatrix.from_file, but it doesn't look like the runner SparCC.py is ever calling it. However, you're right that basis_corr is expecting col=OTUs,row=samples. Am I missing something walking through the method chain? It appears to be straightforward to transpose the matrix using the option in from_file, but I want to be sure that it's not happening under the hood somewhere.

    Would this be a moot point if I just switch to PySurvey and use the IO tools there? ;)

    Also, are there any cases where it's better to normalize using something besides Dirichlet? My data are a set of metagenomes that seems pretty standard, but I just want to be as correct as possible.

    Thanks again - this is a huge help.

    Erik

  5. Jonathan Friedman repo owner

    In SparCC, basis_cor transposes the data so that you still get correlations between OTUs by default. In PySurvey, the data is transposed when read and not in basis_cor and oyu get the same effect. Bottom line is that with either of these packages the default settings will read files in the usual format (rows=OTUs) and compute correlations between OTUs.

    With regards to the normalization, the goal is to estimate the true fractions in the community from the counts data. The appropriate estimator depends on the way the sampling was done. For unbiased, independent draws from a large community, Dirichlet is the appropriate posterior distribution of fraction values. Such sampling is typically assumed, but if you have reason to believe the above assumptions are significantly violated, a different posterior distribution may be required.

    Hope this helps,

    Jonathan

  6. Erik Kastman reporter

    Sorry Jonathan, one more question and then I'll close this issue.

    Is your OTUNetwork class available anywhere? I see it referenced as HMPStructures.OTUNetwork or SurveyStructures.OTUNetwork at several places, but I don't see it in the codebase of SparCC, PySurvey, or any of the Gore Lab utilities. Again, from the API I can figure out what it's doing so if it's private within the lab I completely understand, but if you were willing to share it it would save a lot of effort.

    Cheers again,

    Erik

  7. Jonathan Friedman repo owner

    Hi Erik,

    No worries, you can get OTUNetwork here. I haven't put it up since it's quite messy and pretty specialized for my needs, but you're more than welcome to use any bits of it that you find useful.

    J

  8. Erik Kastman reporter

    This should get me everything I need - thanks!

    Looks like a lot, but not all, of the functionality has been duplicated in SparCC and PySurvey. If I were to start refactoring (say, for easier export of the networkx object), I should do it in the PySurvey repo, right?

  9. Jonathan Friedman repo owner

    Yeah, I think working with PySurvey would be better. It uses pandas DataFrame as the main object, which is much more mature then the structures I've made in SparCC. I've also improved many of the function's implementation in PySurvey (including the SparCC).

    I agree that a nice feature of SparCC is that it's so easy to pickup as a standalone module. However, since PySurvey is available on pip, as are it's required dependencies, I don't think installing PySurvey should be too challanging.

    Any refactoring you can do would be greatly appreciated, as I'm currently spending most of my time on unrelated projects.

  10. Log in to comment