# TigerFISH

Fluorescence in situ hybridization (FISH) is among the most promising methods for quantifying the location and number of nucleic acids in single cells but its applications are limited by the lack of reliable and automated image processing for objective quantification.

TigerFISH is a software package that automates all steps in the analysis of FISH images, objectively quantifies and summarizes the data extracted from the images, and generates interactive, web-browser viewable images of identified cells, their phases in the CDC, all spots within each cell that that were identified, and the spots that were assigned to be nucleic acids. The results of the software are validated in FISH experiments with fluorescently labeled mRNAs previously identified to be predominantly expressed in each phase of the cell division cycle (CDC). We hope that this software package, together with the FISH images, will facilitate the wider use of FISH and will be further improved by the community.

The MatLab source code is available on Bitbucket at: https://bitbucket.org/lance_parsons/tigerfish

Authors

## Prerequisites

• TigerFISH is implemented in MATLAB and requires the Image Processing Toolbox.
• The HTML viewer is implemented in Python and requires the Mako template module as well as GhostScript (for the generation of PNG files from PDF source).

## Basic Usage

1. Prepare a tab delimited file listing the experiments input files The columns of the file should be:

Experiment, Region, Cy3_label, Cy3_file, Cy3.5_label, Cy3.5_file, Cy5_label, Cy5_file, DAPI_label, DAPI_file

1. Use parse_experiemnt_direcotory when files for a given experiment are in a single directory:

EXPERIMENT_SET/
EXPERIMENT/
EXPERIMENT_NUMBER-EXPERIMENT_NAME - Position #_CDYE.tiff

DYE should be one of CY3, CY3.5, CY5, DAPI
e.g. NS1-POL30_SUR4_OM45 - Position 1_CCY3.tiff

2. Use generate_experiment_list if the files for each experiment are divided into subdirectories:

EXPERIMENT_SET/
EXPERIMENT/
1/
EXPERIMENT_NUMBER-EXPERIMENT_NAME - Position #_CDYE.tiff

DYE should be one of CY3, CY3.5, CY5, DAPI
e.g. NS1-POL30_SUR4_OM45 - Position 1_CCY3.tiff

generate_experiment_list(path,
[experiment_numbers=1:1000],
[output_filename='experiment_list.txt'],

2. Analyze experiment set:

main( experiment_list_file, output_dir, [ini_file='my_parameters.ini'], [load_results=False] )


The my_parameters.ini file is an ini-style configuration file specifying the parameters for the analysis. See default_parameters.ini for a list of parameters and documentation.

3. Run generateFishView.py to generate HTML pages to view the output:

python generateFishView.py path/to/results -n 'Experiment Set Name'


## Documentation

### Inputs

1. DAPI Images - DAPI stained images are used to identify and segment the cells
2. Fluorescent images - separate fluorescence channels are used to identify and quantify the FISH spots.

### Outputs

1. spot_counts.tsv - Tab delimited file containing the spot counts for each cell. The fields (in order) are:

1. Experiment - The name of the experiment (as defined in the experiment_list.txt file)

2. Region - The region number within the experiment (also defined in the experiment_list.txt file)

3. Cell - The cell number within the region. Cell number zero (0) is actually not a cell, but the background.

4. DYE1 - The number of spots above the threshold found in the DYE1 image.

5. DYE2 - The number of spots above the threshold found in the DYE2 image.

6. DYE3 - The number of spots above the threshold found in the DYE3 image.

7. Cell Phase - The phase of the cell division cycle determined by the DAPI intensity:

-1 -> No phase, background, not cell


0 -> other 1 -> G1 2 -> S 3 -> G2 4 -> other 5 -> other

2. experiment_counts.mat - MatLab data file containing the spot count data

3. EXPERIMENT/ - Directory for each experiment

1. DYE_spot_intensity_histogram.pdf - a histogram of the spot intensities for each dye
2. DNA_Content.pdf - A histogram of DNA content per cell along with model fits for each phase
3. experiment_data.mat - MatLab data file containing the detailed cell and spot data for this experiment
4. Joint Distributions: Probabilities - Joint distributions based on probabilistic spot identification
1. joint_dist_prob_DYE1_DYE2.pdf - PDF plot of probabilistic joint distributions between each dye pair
2. joint_dist_prob_DYE1_DYE2.mat - Probabilistic joint distribution data (MatLab format)
5. Joint Distributions: Thresholds - Joint distributions of spot counts based on thresholding
1. joint_dist_thresh_DYE1_DYE2.pdf - PDF plot of joint distributions between each dye pair
2. joint_dist_thresh_DYE1_DYE2.mat - Joint distribution data (MatLab format)
6. endCountsProbs.mat - Probabilistic count data
7. REGION/ - Directory for each region
1. cell_map.png - Transparent PNG image of cell borders (suitable for use as an overlay)
2. cell_map_phases.png - Same as cell_map.png except color coded by cell division cycle phase
3. DAPI_projection.png - Transparent PNG of maximum projection DAPI image (colored in blue)
4. DYE#_projection.png - Transparent PNGs of each dye channel in green, red, and white
5. DYE#_spot_image.png - Transparent PNG with a circle indicating each spot. Grey circles are below intensity threhold. Colored circles are above the intensity threshold.

NOTE: Running the generateFishView.py will result in the creation of an index.html page in the main output directory. Open that file in any web browser to view the results.

## Approach

Cell nuclei are identified first and watershed segmentation is used to separate individual cells. Potential spots are identified by looking for regional maxima and then filtered based on intensity pattern. When using "Singer" probes, many of the identified spots are due to unbound single probes. To separate these cases, TigerFISH estimates spot intensity using one of three methods (described below) and uses a threshold to separate the lower intensity spots.

Cell Segmentation

Cell identification and segmentation uses watershed segmentation of a DAPI stained image. First, we separate cells from the background using Otsu's method. Then, cell nuclei are identified by an extended maxima operator that identifies groups of pixels that are significantly brighter than their immediate surrounding. The pixels forming the core of the cell nuclei are further processed by morphological dilation followed by an erosion and filling up of small holes. The nuclei are then used as seeds in a watershed algorithm to identify the cell borders, based on the autofluorescence of the cytoplasm in the DAPI channel. These simple operations are all implemented by MatLab functions and result in cell identification consistent with the cells identified visually, as can be seen from the interactive html-interface.

Spot Identification

To identify spots in the FISH images, we start by enhancing the image and equalizing the background with a tophat transformation. We then identify regions with intensity greater than twice that of the background (non-cell region) and treat those as potential spots. To further separate nearby spots, regional maxima are found for each of those regions. We then identify the best layer in the z-stack where the spot is brightest to find the best focal layer for each spot. This also helps to separate spots that close to each other on the xy-plane, but not on the z-plane.

Identified potential spots are then filtered based on the expectation that the intensity of a bona fide spot should be highest at its center and decrease for pixels further away from its center. This expectation is implemented by computing a contrast for each spot, which is the ratio of mean intensities of pixels from the center of the spot and pixels forming a concentric squares with a side of 10 pixels centered around the center of the identified spot. In the absence of noise the contrast should be 1. The more noisy a channel is, the higher the threshold has to be and the exact value can be determined by the noise level in a channel from control images. A simple approach is to use the default contrast, which works very well with most images, and adjust it if necessary based on the visual inspection of the images and the identified spots from the interactive interface.

Spot Measurement

Once the spots are identified, the software computes their intensity, which is then used to separate dimmer spots, corresponding to noise and single probes from the brighter spots, corresponding to multiple probes bound to the nucleic acid of interest. We implemented three algorithms for the estimation of spot intensity (i) fitting a 2-dimensional Gaussian (in the maximum projection image) to each spot with the subtraction of global background as described by. (ii) Using the same fitting procedure as in (i) but subtracting the local background. (iii) a nonparametric 3-dimensional estimate based on the empirical distribution of pixel intensity. First the background of a spot is estimated as the mean intensity of the pixels from the concentric cubes 5 pixels away from the center of the spot identified by the spot finding algorithm. This background is subtracted from the intensity of the pixels enclosed by the sphere. Then the probability that a spot is centered at a pixel is assumed to be proportional to the intensity of that pixel and estimated as the intensity divided by the sum of all pixel intensities, normalizing the probability density function to 1. Finally, the spot intensity is estimated as the sum of the pixel intensities multiplied by their corresponding probabilities.

Identified spots may be noise, single probes, or multiple hybridized probes. It is assumed that most single probes are due to non-hybridized probes that were not washed out, and thus it is desirable to separate the brighter, multi probe signals from the single probes and noise. Once the spot intensities are estimated, the spots corresponding to noise and single probes can be separated from the spots corresponding to nucleic acids by thresholding the estimated spot intensities. For highly abundant mRNAs, the distribution of spot intensities is multi-modal, with the first mode corresponding to single probes, second mode to nucleic acids with two bound probes and so on, as shown in figure 2A. For less abundant mRNAs, however, the vast majority of identified spots correspond to single probes and this results in very small modes for the nucleic acids with multiple probes that are often indistinct. For all cases, one may choose a suitable threshold based on the visual inspection of the spots in our interactive image viewer.

Robust Probabilistic Spot Analysis

In addition to hard thresholding, our package provides functionality for assigning probabilities to each identified spot to correspond either to noise/single probe or to a nucleic acid. This probabilistic approach relies on a null distribution, which may be computed in two different ways: (i) Hybridizing the probes in cells in which the target gene has been knocked out. The distribution of spots intensities for that gene within the cytoplasm of such cells can provide a well controlled null distribution. (ii) As an easier alternative, the null distribution can be computed from the spots outside of cells as those should correspond only to noise and single probes. This second alternative works well only when there are enough spots detected outside of the cells, and can be undermined by environment dependent changes in the fluorescent properties of the fluorophores used to label the probes.

As mentioned before, the distributions of spot intensities ideally should have well defined modes corresponding to (i) single probes and noise and (ii) to nucleic acids labeled with multiple probes. If the separation between these modes is not complete, using a hard threshold is likely to introduce false positive and negative assignments (mRNAs will not be counted or single non-hybridized probes will be counted as mRNAs). The bigger the overlap between the two modes, the bigger the error in mRNA quantification. A second problem with a hard threshold for the intensity of spots is that the position of the threshold can depend strongly on numerous parameters such as incident light intensity, efficiency of probe labeling, spectral filters, fluorophore quantum efficiency, and even sample preparation. Therefore, establishing a good threshold might require a set of control experiments specific to the equipment and every set of samples, or human decision (and the potential bias) about the threshold position in every single experiment.

To mitigate those problems, we developed a simple approach based upon the conditional probability that the $$j^{th}$$ spot is mRNA $$p(X_j=1)$$ given its intensity, $$I_j$$. The key assumption behind our approach is that an empirical null distribution can be computed. Assuming that there are no mRNAs (or very few mRNAs) outside of cells, such a distribution can be computed from the distribution of spot intensities outside of cells. This assumption of very few mRNAs outside of cells is supported by the data in most experiments and in the experiments where it is violated (because of cell bursting during cell wall digestion and immobilization) can be avoided by using extracellular spots from the experiments that worked well. When the assumption is correct, all spots outside of cells correspond to single probes and the empirical cumulative distribution of their intensities characterizes the probability for a spot with a given intensity to correspond to a single probe. For example, a spot within a cell whose intensity is higher than the intensities of all spots outside of cells has a probability of being a single-probe equal to $$1/N$$, where $$N$$ is the number of spots outside of cells. If many experiments are performed using the same equipment and sample preparation, all extracellular spots (from all experiments) for a dye (such as Cy3) can be pulled together and used as the null distribution of intensities of single probes. Formally, the conditional probability $$p(X_j=1|I_j)$$ for the $$j^{th}$$ spot to be a mRNA can be written as $$p(X_j=1|I_{\omega}) = \mathcal D(I_j|I_{\omega})$$. Here $$\mathcal D(I_{\omega})$$ is the empirical cumulative distribution for the set of spots $$(\omega)$$ that are outside of cell boundaries. Using $$p(X_j=1|I_j)$$ we compute both Bonferroni corrected p-values and q-values that can be used to select the spots likely to correspond to mRNAs while keeping the false discovery rate (FDR) below a defined level, such as 5%. The FDR threshold can be set at the desired level from the input parameters of our software.

Quantifying the Number of mRNAs per Cell

In the previous subsection, we outlined an approach for quantifying the probability for the $$j^{th}$$ spot to be an mRNA, $$p(X_j=1|I_j)$$. Next, we want to use these probabilities for each spot ($$p(X_j=1|I_j)$$) for computing the marginal probabilities for the distribution of the $$k^{th}$$ gene in the $$i^{th}$$ cell, that is, the probability that the $$i^{th}$$ cell contains $$n$$ mRNAs from the kth gene, $$p(Y_{ik}=n)$$. Assuming that the $$p(X_j=1|I_j)$$ are independent of each other, $$p(Y_{ik}=n)$$ follows a multinomial distribution whose expectations are $$p(X_j=1|I_j)$$. Given independence of the error in identifying the mRNAs for different genes, the joint probabilities for the $$k^{th}$$ and the $$l^{th}$$ mRNAs can be computed as the product of the corresponding marginal probabilities, $$p(Y_{ik}=n, Y_{il}=n) = p(Y_{ik}=n) p(Y_{il}=n)$$.

Updated