Commits

mimakaev  committed 0783632

Added new class to analyse Hi-C data at high resolution.

  • Participants
  • Parent commits c09bc32

Comments (0)

Files changed (4)

File doc/source/highResBinnedData.rst

+High-resolution analysis of binned data
+=======================================
+  
+.. automodule:: hiclib.highResBinnedData 
+    :members:

File doc/source/index.rst

 7. Test the installation using the scripts from the /tests folders.
 
 .. note:: If you're upgrading a package (numpy/scipy/matplotlib/cython/etc...) 
-          using the pip Python package manager, never use "pip install PACKAGENAME",
-          only "pip install --upgrade PACKAGENAME"! 
+          using the pip Python package manager, use "pip install PACKAGENAME",
+          not "pip install --upgrade PACKAGENAME"! 
 
           If you have another version of the same package installed via the
           standard system package manager, e.g. apt-get or aptitude,
 For example, working with 3 datasets at 200-kb resolution will require: 
 (3 GB/200kb)^2 * (8 bytes per float64) * (3 datasets + 1 extra) = 8 GB RAM
 
+highResBinnedData uses sparse logic to store Hi-C matrix in memory (in a compressed format), or on the HDD. 
+Even in memory storage often allows to fit a 500mln read human Hi-C dataset at 10kb resolution in under 8GB (thanks to HDf5!)
+
 Timewise, fragment-based correction of a dataset is relatively quick, being about an hour for 500 million raw reads.
 Binned data analysis performance depends on resolution only, usually being of the order of seconds for 1M-resolution analysis, and scaling as (1/resolution^2),
-reaching few minutes for 200kb resolution, and tens minutes for 100kb. 
+reaching few minutes for 200kb resolution, and tens minutes for 100kb. High-resolution binned data takes few hours to complete.
 
 Overall, time-limiting step is mapping, and it's the only part of the code that is well-paralellized.
 Multi-threading is only utilized by bowtie when mapping reads, and takes about a few days to map 500-million-read dataset on 
 
 An example script shows how  fragment-level analysis can be used to merge multiple datasets together, filter them and save heatmaps to an h5dict.
 
-Heatmaps and vectors of SS reads are then exported by fragment-level analysis to an h5dict, that can be passe to a binnedData class. 
+Heatmaps and vectors of SS reads are then exported by fragment-level analysis to an h5dict, that can be passed to a binnedData class. 
 BinnedData class will load this h5dict, or any other dict-like object,
 perform multiple types of iterative correction and eigenvector expansion, and compare results with different genomic tracks.
 
+High-resolution binned data class can be used to perform Iterative Correction at high resolutions (was tested at 10kb resolution). 
+It is much less feature-rich, and relies on the user to do all the analysis. It can operate on one dataset only.
+
 
 Example scripts
 ---------------
    mapping
    fragment 
    binneddata
+   highResBinnedData
    tutorial
    troubleshooting
 
 
+
 Indices and tables
 ==================
 

File src/hiclib/fragmentHiC.py

                                   os.path.split(self.filename)[0])
 
         self.h5dict = h5dict(self.filename, mode=mode, in_memory=inMemory)
-        try:
-            self.DSNum = self.DS.sum()
-        except:
-            pass
+        if "DS" in self.h5dict.keys():
+            DS = self.DS
+            self.N = len(DS)
+            self.DSnum = DS.sum()
 
     def _setData(self, name, data):
         "an internal method to save numpy arrays to HDD quickly"
             Resolution of an all-by-all heatmap
         countDiagonalReads : "once" or "twice"
             How many times to count reads in the diagonal bin
-        
+
         """
 
         try:

File src/hiclib/highResBinnedData.py

+#(c) 2012 Massachusetts Institute of Technology. All Rights Reserved
+# Code written by: Maksim Imakaev (imakaev@mit.edu)
+
+"""
+This is a basic toolbox to perform high-resolution analysis of a Hi-C data.
+It can perform basic filtering (poor coverage regions), and iterative correction at any resolution.
+
+The main advantage of this class is that it supports both in memory and HDD storage,
+and for HDD storage it supports both sparse and dense matrix logic.
+
+Class structure
+---------------
+
+Class defaultMatrix implements a wrapper around a 2D numpy.array matrix. This
+class defines methods that will be applied to a Hi-C map between two
+chromosomes. This class should serve as a template and backup for other Hi-C map
+storage classes which will be inhereted from this class.
+
+Class h5dictMatrix implements a subclass of a defaultMatrix, where the matrix is
+in fact stored on the HDD. It leaves all other methods intact, as  all other
+methods use only self.setData() and self.getData(). Therefore overriding getter
+and setter is enough to move matrix storage from memory to an h5dict.
+
+Class h5dictSparseMatrix implements a subclass of defaultMatrix, where all the
+methods were overwrridden to operate in a sparse logic. It stores a Hi-C map as
+three arrays: X coordinate, Y coordinate and Value. Therefore, for sparse
+matrices all overriden methods will be much faster than default, as they will
+bypass calling self.getData and creating a huge matrix in memory. It is
+suggested to use h5dictMatrix for cis maps and h5dictSparseMatrix for trans
+(between-chromosomal) maps.
+
+Class HiResHiC allows to load a by-chromosome Hi-C map, with either cis only,
+or cis and trans matrices. It then allows to do an iterative correction of the
+map, and allows for some other filtering. Unlike binnedData, it does not support
+multiple datasets at once, or complicated filtering, or PCA. It will support by-
+chromosome domain finder from cis reads at some point as we converge on the
+ideal algorithm to do this.
+
+On my machine with 32GB RAM it was successfully used to perform IC of a human
+Hi-C at 10kb resolution. It took it 20 minutes to load the data, 4 mins to
+remove poor bins, and couple hours to perform IC. I also note that the data was
+in fact stored in memory for doing that, and it never used more than 16GB of
+RAM... in fact, creating this dataset used more, but this will be optimized later.
+"""
+
+
+from mirnylib.genome import Genome
+import numpy as np
+import warnings
+from mirnylib.h5dict import h5dict
+from mirnylib.plotting import mat_img
+from mirnylib.numutils import removeDiagonals
+from mirnylib.systemutils import setExceptionHook
+
+setExceptionHook()
+
+
+class defaultMatrix(object):
+    """
+    This is a template object which stores matrix in memory.
+    Alternatively, matrix can be stored in an h5dict, either normally
+    or in a sparse mode.
+
+    All the methods should be first implemented here using getData() and setData()
+    Then they shold be translated to sparse subclasses of this class.
+    """
+    def __init__(self, data=None, dictToSave=None, key=""):
+        """
+        Initializes the object that stores the Hi-C matrix.
+
+        Parameters
+        ----------
+
+        data : 2D matrix in any format
+            Hi-C matrix between two chromosomes
+        dictToSave : dict, h5dict or any other dict-like structure
+            Dict to store actual data. Should be h5dict for high resolution analysis.
+            Is not needed for defaultMatrix, will be used by it's subclasses only.
+        key : str or anything
+            A key, unique for each pair of chromosomes, used to identify dataset
+            in the dictToSave. Is provided by HiResHiC.
+        """
+        self._h5dict = dictToSave
+        self._key = repr(key)
+
+        if data is not None:
+            self.setData(data)
+
+    def getData(self):
+        "Returns a matrix in a dense format (NxM array)"
+        return self.data.copy()
+
+    def setData(self, data):
+        """Accepts the matrix to store. Here, just puts it in RAM.
+        For further subclasses this will need to convert the
+        matrix to the sparse form.
+        """
+        data = np.asarray(data, dtype=np.float64)
+        assert len(data.shape) == 2
+        self.data = data
+
+    def getSumX(self):
+        "returns sum of all values along the first (0) axis, i.e. sum of all rows"
+        return np.sum(self.getData(), axis=0)
+
+    def getSumY(self):
+        "returns sum of all values along the second (1) axis, i.e. sum of all columns"
+        return np.sum(self.getData(), axis=1)
+
+    def getSums(self):
+        "returns getSumX and getSumY without calling getData() twice"
+        data = self.getData()
+        sumX = np.sum(data, axis=0)
+        sumY = np.sum(data, axis=1)
+        return (sumX, sumY)
+
+    def clearRows(self, rows):
+        """Sets to 0 certain rows and columns
+
+        Parameters
+        ----------
+        rows : tuple of two arrays (rows, columns)
+            Two arrays which hold indices of rows and colums to be removed
+        """
+        rowsX, rowsY = rows
+        data = self.getData()
+        data[rowsX, :] = 0
+        data[:, rowsY] = 0
+        self.setData(data)
+
+    def divideByVectorX(self, vectorX):
+        """
+        Divides each row by correspoinding value from vectorX
+        """
+        vectorX[vectorX == 0] = 1
+        data = self.getData()
+        data /= vectorX[None, :]
+        self.setData(data)
+
+    def divideByVectorY(self, vectorY):
+        """
+        Divides each column by correspoinding value from vectorY
+        """
+
+        vectorY[vectorY == 0] = 1
+        data = self.getData()
+        data /= vectorY[:, None]
+        self.setData(data)
+
+    def divideByVectors(self, vectors):
+        """
+        Divides each row and column by correspoinding
+         value from vectors[0] and vectors[1]
+
+        Does it without calling getData twice!
+        """
+
+        vecX = vectors[0]
+        vecY = vectors[1]
+        vecX[vecX == 0] = 1
+        vecY[vecY == 0] = 1
+        data = self.getData()
+        assert data.shape[1] == len(vecX)
+        assert data.shape[0] == len(vecY)
+        data /= vecX[None, :]
+        data /= vecY[:, None]
+        self.setData(data)
+
+    @property
+    def shape(self):
+        """Returns shape of the data.
+        Should be overridden in subclasses
+        not to load the data every time! """
+        return self.getData().shape
+
+
+class h5dictMatrix(defaultMatrix):
+    """
+    Changes the storage from memory to h5dict, keeping the matrix
+    in the dense (regular) format.
+    Just overrides getData, setData and shape to use h5dict.
+    """
+    def getData(self):
+        data = self._h5dict[self._key]
+        return data
+
+    def setData(self, data):
+        data = np.asarray(data, dtype=np.float64)
+        self.savedShape = data.shape
+        self._h5dict[self._key] = data
+
+    @property
+    def shape(self):
+        return self.savedShape
+
+
+class h5dictSparseMatrix(defaultMatrix):
+    """
+    Changes the storage from memory to h5dict,
+    and changes matrix to sparse.
+    All methods from defaultMatrix are overridden here!
+    """
+
+    def __init__(self, data=None, dictToSave=None, key=""):
+        self._h5dict = dictToSave
+        self._key = repr(key)
+
+        self._keyx = self._key + "x"
+        self._keyy = self._key + "y"
+        self._keyv = self._key + "v"
+        if data is not None:
+            self.setData(data)
+
+    @property
+    def _X(self):
+        return self._h5dict[self._keyx]
+
+    @_X.setter
+    def _X(self, data):
+        self._h5dict[self._keyx] = data
+
+    @property
+    def _Y(self):
+        return self._h5dict[self._keyy]
+
+    @_Y.setter
+    def _Y(self, data):
+        self._h5dict[self._keyy] = data
+
+    @property
+    def _V(self):
+        return self._h5dict[self._keyv]
+
+    @_V.setter
+    def _V(self, data):
+        self._h5dict[self._keyv] = data
+
+    @property
+    def shape(self):
+        return self.savedShape
+
+    def getData(self):
+        data = np.zeros(self.savedShape)
+        data[self._X, self._Y] = self._V
+        return data
+
+    def setData(self, data):
+        x, y = np.nonzero(data)
+        self.savedShape = data.shape
+        values = data[x, y]
+        values = np.asarray(values, np.float64)
+        self._X = x
+        self._Y = y
+        self._V = values
+
+    def getSumX(self):
+        return np.bincount(self._Y, weights=self._V, minlength=self.shape[1])
+
+    def getSumY(self):
+        return np.bincount(self._X, weights=self._V, minlength=self.shape[0])
+
+    def getSums(self):
+        X, Y, V = self._X, self._Y, self._V
+        s1 = np.bincount(Y, weights=V, minlength=self.shape[1])
+        s2 = np.bincount(X, weights=V, minlength=self.shape[0])
+        return (s1, s2)
+
+    def divideByVectorX(self, vecX):
+        vecX[vecX == 0] = 1
+        self._V = self._V / vecX[self._Y]
+
+    def divideByVectorY(self, vecY):
+        vecY[vecY == 0] = 1
+        self._V = self._V / self.vecX[self._X]
+
+    def divideByVectors(self, vecs):
+        vecX = vecs[0]
+        vecY = vecs[1]
+        V = self._V
+        V /= vecX[self._Y]
+        V /= vecY[self._X]
+        self._V = V
+
+    def clearRows(self, rows):
+        rowsX, rowsY = rows
+        indexX = np.ones(self.shape[0], bool)
+        indexY = np.ones(self.shape[1], bool)
+        indexX[rowsX] = False
+        indexY[rowsY] = False
+        X = self._X
+        Y = self._Y
+        mask = indexX[X]
+        mask *= indexY[Y]
+        self._X = X[mask]
+        self._Y = Y[mask]
+        self._V = self._V[mask]
+
+
+"""
+
+a = np.zeros((20, 100), float)
+a.flat[np.random.randint(0, 2000, 100)] = np.random.random(100)
+dd = h5dict("bla", in_memory=True)
+b = h5dictSparseMatrix(a, dd)
+b2 = defaultMatrix(a)
+
+
+sums = b.getSums()
+sums2 = b2.getSums()
+print [len(i) for i in sums],
+print [len(i) for i in sums2]
+print b.getData().sum(),
+print b2.getData().sum()
+b2.divideByVectors(sums)
+b.divideByVectors(sums)
+print ((b.getSumX() - b2.getSumX()) ** 2).sum()
+print b2.getData().sum()
+exit()
+
+"""
+
+
+class HiResHiC(object):
+    """
+    Class to store Hi-C data at high resolution,
+    perform basic filtering and iterative correction.
+
+    This class provides an efficient storage for the Hi-C data, allowing to retrive a
+    heatmap between any chromosomes quickly.
+    """
+
+    def __init__(self, genome, resolution, storageFile="inMemory", mode="w"):
+        """
+        Initializes the high-resolution Hi-C data storage.
+
+        Parameters
+        ----------
+
+        genome : folder or Genome object
+            matching Genome object or folder to load it form
+        resolution : int
+            Resolution (number of bp per bin)
+        storageFile : str (optional)
+            File to store the h5dict.
+            File will be created.
+            By default stores in memory
+        mode : "w", "w-" "r+" or "a", optional
+            Access mode to h5dict (see h5dict manual)
+        """
+
+        inMemory = (storageFile == "inMemory")
+
+        self._h5dict = h5dict(storageFile, mode=mode, in_memory=inMemory)
+
+        if type(genome) == str:
+            genome = Genome(genome, readChrms=["#", "X"])
+        assert isinstance(genome, Genome)
+        self.genome = genome
+
+        self.resolution = resolution
+        self.genome.setResolution(resolution)
+
+        if self.genome.numBins < 7000:
+            print "Total number of bins in the genome is just %d" % self.genome.numBins
+            warnings.warn("For low-resolution analysis use binnedData, as it provides"
+                          "more analysis tools")
+
+        M = self.genome.chrmCount
+        self.cisKeys = [(i, i) for i in xrange(M)]
+        self.transKeys = [(i, j) for i in range(M) for j in range(M) if j > i]
+        self.allKeys = self.cisKeys + self.transKeys
+
+        self.data = {}
+
+    def _checkConsistency(self):
+        "Checks if shapes of all datasets are correct"
+
+        for i in self.data:
+            shape = self.data[i].shape
+            chr1, chr2 = i
+            expected = (self.genome.chrmLensBin[chr1],
+                        self.genome.chrmLensBin[chr2])
+            if (expected[0] != shape[0]) or (expected[1] != shape[1]):
+                raise ValueError("Wrong dimensions of a chromosomes {2}: expected {0},"
+                                 "got {1}".format(expected, shape, i))
+
+    def _marginalError(self, marginals=None):
+        "Checks after each pass of IC, if marginals are close enough to 1"
+        if marginals is None:
+            if not hasattr(self, "marginals"):
+                return 99999
+            marginals = self.marginals
+        marginals = np.concatenate(marginals)
+        marginals = marginals[marginals != 0]
+        error = np.max(np.abs(marginals - marginals.mean()))
+        return error / marginals.mean()
+
+    def getCombinedMatrix(self, force=False):
+        """Combines all matrices into one giant matrix.
+
+        Should be used for testing only, as it will run out of RAM at high resolution.
+        """
+        if self.genome.numBins > 15000:
+            warnings.warn("Resolution is too high. Be careful with RAM!")
+        if (self.genome.numBins > 30000) and (force is False):
+            raise RuntimeError("Matrix will take more than 8GB RAM. Use force to override")
+        t = np.zeros((self.genome.numBins, self.genome.numBins), float)
+        for chr1, chr2 in self.data:
+            data = self.data[(chr1, chr2)].getData()
+            beg1, end1 = self.genome.chrmStartsBinCont[
+                chr1], self.genome.chrmEndsBinCont[chr1]
+            beg2, end2 = self.genome.chrmStartsBinCont[
+                chr2], self.genome.chrmEndsBinCont[chr2]
+            t[beg1:end1, beg2:end2] = data
+            t[beg2:end2, beg1:end1] = data.T
+        return t
+
+    def setRowsToZero(self, rows):
+        """Sets select rows to zero
+
+        Parameters
+        ----------
+        rows : list of arrays/lists
+            List of per-chromosome row numbers to be removed.
+            Counting starts from 0 for each chromosome.
+        """
+        rows = [np.array(i) for i in rows]
+        for chr1, chr2 in self.data:
+            self.data[(chr1, chr2)].clearRows((rows[chr1], rows[chr2]))
+
+    def loadData(self, dictLike,
+                 keyFunction=lambda x: ("%d %d" % x),
+                 mode="All",
+                 cisProtocol=h5dictMatrix,
+                 transProtocol=h5dictSparseMatrix):
+        """
+        Parameters
+        ----------
+
+        dictLike : dictionary-like structure or str
+            either by-chromosome h5dict, generated by fragmentHiC, or
+            any dictionary-like object with by-chromosome heatmaps
+        keyFunction : function tuple->string
+            Function to convert chromosome pairs (chr1, chr2) to a
+            key in a dictLike. Default: "chr1 chr2"
+        mode : "all" or "cis"
+            Use or not use trans chromosome maps
+        cisProtocol : class similar to defaultMatrix
+            see below
+        transProtocol : class similar to defaultMatirx
+            see below
+
+
+        cisProtocol should implement all functions, currently defined in the
+        defaultMatrix protocol. If inhereted from defaultMatrix, it should
+        implement proper get and set functions It cannot store the matrix itself
+        in memory, and should forget it after any function call.
+        """
+
+        if mode.lower() not in ["cis", "all"]:
+            raise ValueError("Mode can be only 'cis' or 'all'")
+
+        if type(dictLike) == str:
+            try:
+                dictLike = h5dict(dictLike, 'r')
+            except:
+                raise ValueError("Cannot open h5dict at filename %s" %
+                                 dictLike)
+
+        for myKey in self.cisKeys:
+            try:
+                data = dictLike[keyFunction(myKey)]
+            except KeyError:
+                raise KeyError("Key {0} not found in h5dict".format(
+                    keyFunction(myKey)))
+
+            self.data[myKey] = cisProtocol(
+                data, dictToSave=self._h5dict, key=myKey)
+
+        if mode.lower() == "all":
+            for myKey in self.transKeys:
+                try:
+                    data = dictLike[keyFunction(myKey)]
+                except KeyError:
+                    raise KeyError("Key {0} not found in h5dict".format(
+                        keyFunction(myKey)))
+                self.data[myKey] = transProtocol(
+                    data, dictToSave=self._h5dict, key=myKey)
+
+        self._checkConsistency()
+
+    def getMarginals(self, normalizeForIC=False):
+        """
+        Returns a sum over each row/column, and saves them to self.marginals
+
+        normalizeForIc=True will normalize them to mean 1.
+        """
+
+        marginals = [np.zeros(i, float) for i in self.genome.chrmLensBin]
+        for chr1, chr2 in self.data:
+            m2, m1 = self.data[(chr1, chr2)].getSums()
+            marginals[chr1] += m1
+            if chr1 != chr2:
+                marginals[chr2] += m2
+
+        self.marginals = marginals
+        if normalizeForIC is True:
+            allMarginals = np.concatenate(marginals)
+            divide = allMarginals[allMarginals > 0].mean()
+            for i in marginals:
+                i /= divide
+                i[i == 0] = 1
+                i -= 1
+                i *= 0.6
+                i += 1
+
+        return marginals
+
+    def divideByMarginals(self, marginals=None):
+        """Divides matrix by a vector.
+        If the vector is not provided, will divide it by a
+        marginals calculated previously.
+        """
+        if marginals is None:
+            marginals = self.marginals
+
+        for chr1, chr2 in self.data:
+            m2 = marginals[chr1]
+            m1 = marginals[chr2]
+            self.data[(chr1, chr2)].divideByVectors((m1, m2))
+
+    def iterativeCorrection(self, tolerance=1e-2):
+        """
+        Performs iterative correction in place.
+        Tolerance is the maximum allowed relative deviation of the marginals.
+
+        Tries to set biases to self.biases after IC.
+        """
+
+        curPass = 0
+        marginals = np.zeros(self.genome.numBins, float)
+        while self._marginalError() > tolerance:
+            m = self.getMarginals(normalizeForIC=True)
+            marginals *= np.concatenate(m)
+            self.divideByMarginals()
+            print "Pass = %d, Error = %lf" % (curPass, self._marginalError())
+            curPass += 1
+        self.biases = marginals
+
+    def removeDiagonal(self, m=1):
+        """
+        Removes diagonal from the data.
+        m=0 is main diagonal, m=1 is main +-1, etc.
+        """
+        for i in self.cisKeys:
+            data = self.data[i].getData()
+            removeDiagonals(data, m)
+            self.data[i].setData(data)
+
+    def removePoorRegions(self, percent=0.5):
+        """
+        Removes "percent" percent of regions with low coverage.
+        """
+        marg = self.getMarginals(normalizeForIC=False)
+        allMarg = np.concatenate(marg)
+        cut = np.percentile(allMarg[allMarg > 0], percent)
+        nonZeroMarg = allMarg[allMarg > 0]
+        numRemoved = (nonZeroMarg < cut).sum()
+        toRemove = [np.nonzero(i < cut)[0] for i in marg]
+        self.setRowsToZero(toRemove)
+        print "Removed %d poor bins" % numRemoved
+
+    def export(self, filename):
+        mydict = h5dict(filename)
+        for i in self.allKeys:
+            data = self.data[i].getData()
+            mydict["heatmap%d %d"] = data
+        mydict["resolution"] = self.resolution
+
+
+