1. Flaherty
  3. gemini



How to use GEMINI

GEMINI is an open-source bioinformatics tool and website written in python that makes nearest-neighbor search for gene-expression data fast and useful.

Vantage-point trees

GEMINI uses a Vantage-point tree structure to improve search speed. Vantage point-trees are binary search trees such that data instances within a certain distance tau of the root node are all placed in the left sub-tree, and all the data instances farther than tau are placed in the right sub-tree. Each sub-tree is recursively partitioned until each node contains one data point.

Using the VPTree class

GEMINI uses a python implementation of vantage-point trees for efficient querying. gemini.py contains the node and tree classes that support the data structure.

To construct a tree, use the VPTree constructor:

tree = VPTree(points)

where points is a list of objects of type NDPoint. To search the tree, use the get_nearest_neighbors(tree, query, k) method, which returns a list of k-nearest neighbors to query.

Example: Using a vp-tree on 2-dimensional datapoints:

>>> from gemini import VPTree, NDPoint

>>> items = [(0,0), (12,3), (5,4), (25,25)]

>>> points = [NDPoint(x,i) for i, x in enumerate(items)]

>>> tree = VPTree(points)

>>> query = NDPoint((1,1))

>>> neighbors = get_nearest_neighbors(tree, query, k=4)

>>> for (_,n) in neighbors: print(str(n.x))





HDF5 File Format

GEMINI stores raw gene expression profiles using the .hdf5 file format, and uses the h5py python library for reading and writing files. Data in HDF5 format is organized into datasets, which can be stored and accessed via keyword. Data tables in GEMINI have three datasets:

  • 'Samples' stores an array of sample ID numbers (ex: "TCGA-AR-0076...")
  • 'Feature' stores an array of gene IDs (ex: "BRCA1")
  • 'Data' stores a Samples*Feature sized matrix of gene expression values

GEMINI Web Interface

To search for nearest neighbors, create an .hdf5 query file using the same dataset names as above ('Sample', 'Feature', and 'Data'). A query should have just one sample ID and row in the data field. Example dataset files for building the vp-tree and query files are provided in the repository. GEMINI will use the first item and search against a vp tree constructed from the original dataset.

From the main page, select a database, then upload your query file. Click search, and the website will show the 10 nearest neighbors in the database along with the numeric distances, and a heatmap showing the top 10 principal components from each of the results:


Command Line Builder

The command line interface to gemini uses subcommands and options. For details in-line, type python gemini.py at the command prompt and help will be returned.

GEMINI options:

- `--version` provides version number
- `-v` increases verbosity to INFO level
- `-vv` increases verbosity to DEBUG level
- `build` used to build vantage-point tree from a dataset
- `search` used to search a tree built with the build command
build options:
- `hdfFilename` hdf5 data file containing data set (required)
- `-o outputFile` saves the pickled vantage-point tree in outputFile rather than the default `vptree.p` (optional)
search options:
- `queryFilename` the hdf5 query file (required)
- `treeFilename` the pickle file containing the vp-tree (required)
- `K` the number of nearest neighbors returned (optional, default is 5).


The easiest method is to use our virtualenv environment which has each of the required modules installed. The list of requirements is in the requirements.txt file auto-generated by pip freeze.

Contact us

For questions or issues contact Patrick Flaherty.


This work is licensed under a Creative Commons Attribution 4.0 International License