Commits

Qian Zhu committed d484156

Added SeekPrep usage doc
Cleaned up SeekPrep.cpp

  • Participants
  • Parent commits 918b587

Comments (0)

Files changed (9)

File src/seeknetwork.h

  * otherwise the size of the array)
  * \li Byte #9 and onward: \a S times \a N bytes specifying the array content
  *
+ * IMPORTANT:
+ * <b>Outgoing messages are always encoded using bytes in the Little Endian order.</b>
+ *
+ *
  * \section sec_in Incoming messages
  *
  * On the receiving end, CSeekNetwork also supports the receiving of a \c char array (or a \c string) or a \c float array.
  * \li Byte #5 and onward: \a NF times 4 bytes specifying the \c float array.
  *
  * IMPORTANT:
- * <b>
- * Outgoing messages are always encoded using bytes in the Little Endian order.
- *
- * For an incoming message to be properly recognized, the message should also be encoded with bytes in the Little Endian order.
+ * <b>For an incoming message to be properly recognized, the incoming message should also be encoded with bytes in the Little Endian order.
  * </b>
  */
 class CSeekNetwork{

File tools/SeekMiner/SeekMiner.cpp

 	bool bOutputWeightComponent = !!sArgs.output_w_comp_flag;
 	bool bSimulateWeight = !!sArgs.simulate_w_flag;
 
-	bool bOutputWeightComponent = !!sArgs.output_w_comp_flag;
-	bool bSimulateWeight = !!sArgs.simulate_w_flag;
-
 	// Random Number Generator Initializations
 	gsl_rng_env_setup();
 

File tools/SeekMiner/stdafx.cpp

 
 /*!
  * \page SeekMiner SeekMiner
- * 
- * SeekMiner returns a gene-ranking based on the coexpressions to the user-specified
- * query genes. It finds relevant datasets by using one of the many dataset weighting
- * algorithms, including the query-coexpression weighting, the order statistics 
- * weighting, etc. Afterward, it performs a weighted integration of coexpressions
- * using the computed dataset weights.
- * The search algorithms employed by Seek are designed to be quick and efficient, and
- * they support the real-time weight calculations for thousands of microarray
- * datasets.
+ *
+ * SeekMiner accepts a set of genes as query in order to perform a weighted compendium 
+ * search for additional genes that are coexpressed with the query genes. 
+ * SeekMiner finds and integrates relevant datasets by using one of the many dataset weighting
+ * algorithms, including the cross-validated query-coexpression weighting, the order statistics 
+ * weighting, etc. 
+ * These search algorithms are designed to be quick and efficient, and enable fast weight computations
+ * for thousands of microarray datasets.
  *
  * \section sec_usage Usage
  * 
  * SeekMiner -x <dset_platform_map> -i <gene_map> -q <query> -P <platform_dir> -p <prep_dir> -n <num_db>
  * -d <db_dir> -Q <quant> -o <output_dir> -V <weight_method> -z <distance_measure> -m [-D <search_dset>]
  * \endcode
- * This performs coexpression mining for a list of queries in the file \c query,
+ * This performs the coexpression search for a list of queries,
  * and outputs the gene-ranking and the dataset weights in the \c output_dir.
  *
- * \subsubsection sec_output Output
- *
- * The output files are divided according to queries.
- * Starting with the first query (with a file name 0), its final results
- * will consist of three files: \c 0.query, 0.dweight, 0.gscore.
- * \li The file base name (0) indicates the query index in the list.
- * \li The \c 0.query stores the space-delimited query gene-set in text.
- * \li The \c 0.dweight stores the weightings of datasets as a binary one-dimensional float vector
- * (see SeekEvaluator for displaying a DWEIGHT extension file).
- * \li The \c 0.gscore stores the gene scores as a binary one-dimensional float vector
- * (see SeekEvaluator for displaying a GSCORE extension file).
  *
  * \subsubsection sec_weight Weighting Datasets
  *
  * \li Equal weighting (\c EQUAL), where all datasets are weighted equally.
  * \li Order statistics integration (\c ORDER_STAT), which is outlined in Adler et al (2009).
  * This method computes a P-value statistics by comparing the rank of correlation across datasets to the
- * ranks that would have been generated a null distribution (where correlations are randomly scattered
- * and all ranks appear equally likely).
+ * ranks that would have been generated a null distribution (where correlations are assumed to be
+ * randomly scattered and all ranks are equally likely).
+ * 
+ * The use of \c -V \c CV is highly recommended.
  *
  * \subsubsection sec_distance Distance Measure and Transformations
  *
  * Users can select between Pearson correlations (\c -z \c pearson) or z-scores of Pearson (\c -z \c z_score).
  * Z-scores is the recommended choice because it normalizes the correlation distribution to a standard normal
  * distribution that can be compared across datasets. In addition, SeekMiner provides the following
- * transformations on z-scores to allow further boosting of signals:
+ * transformations on z-scores to further allow boosting of signals:
  *
  * \li \c --score_cutoff. Cuts off z-scores at a specified value. Z-scores that fall below the cut-off are assigned zero.
- * \li \c --norm_subavg. Subtracts each gene's average z-score to prevent highly connected genes from influencing the z-score of a gene pair
+ * \li \c --norm_subavg. Subtracts each gene's average z-score. This prevents highly connected genes from being constantly returned with top ranks in the ranking.
  * \li \c --norm_subavg_plat. Normalizes z-score by subtracting the average across the platform and dividing by its standard deviation.
  * This is designed to handle potential platform biases on the z-scores.
- * \li \c --square_z. Squaring the z-score to further boost highly correlated gene-pairs.
+ * \li \c --square_z. Squaring the z-score. This is another way to boost the highly correlated gene-pairs.
+ * 
  * It is highly recommended to enable \c --norm_subavg.
  *
  * \subsubsection sec_search Search Datasets
  * If \c -D is used, the search datasets must be selected from the available
  * datasets defined in \c dset_platform_map.
  *
+ * \subsubsection sec_output Output
+ *
+ * The output files are divided according to queries.
+ * Starting with the first query (with a file name 0), its final results
+ * will consist of three files: \c 0.query, \c 0.dweight, \c 0.gscore.
+ * \li The file base name (0) indicates the query index in the list.
+ * \li The \c 0.query stores the space-delimited query gene-set in text.
+ * \li The \c 0.dweight stores the weightings of datasets as a binary one-dimensional float vector
+ * (see \ref SeekEvaluator for displaying a DWEIGHT extension file).
+ * \li The \c 0.gscore stores the gene scores as a binary one-dimensional float vector
+ * (see \ref SeekEvaluator for displaying a GSCORE extension file).
+ *
  * \subsubsection sec_files Query-independent search setting files and directories
  *
  * \c -x \c dset_platform_map
  * \li \c all_platforms.gplatstdev. the platform z-score standard deviation
  * \li \c all_platforms.gplatorder. the order of platforms
  *
- * These binary files are generated by SeekPrep. The specification of this directory is
+ * These binary files are generated by \ref SeekPrep. The specification of this directory is
  * necessary for \c --norm_subavg_plat.
  *
  * \c -p \c prep_dir
  * \li Gene average (GAVG files): indicates the average z-score of each gene in a dataset
  *
  * There should be one pair of these files for <b>every</b> dataset that is specified
- * in \c dset_platform_map. Generated by SeekPrep.
+ * in \c dset_platform_map. Generated by \ref SeekPrep.
  *
  * \c -d \c db_dir
  *
  *
  * Directory that contains the SINFO files, which list a dataset's average z-score between all pairs of genes
  * and the standard deviation. If this directory is provided, there should be one SINFO file for <b>
- * every</b> dataset in \c dset_platform_map. Generated by SeekPrep.
+ * every</b> dataset in \c dset_platform_map. Generated by \ref SeekPrep.
  *
  * 
  * \subsection ssec_usage_detailed Detailed Usage

File tools/SeekPrep/SeekPrep.cpp

 	int numThreads = omp_get_max_threads();
 
 	/* PCL mode */
-	if(sArgs.pcl_flag==1){
+	if(sArgs.pclbin_flag==1){
 
 		if(sArgs.sinfo_flag==1){
-			string pcl_dir = sArgs.pcl_dir_arg;
-			vector<string> pcl_list;
-			CSeekTools::ReadListOneColumn(sArgs.pcl_list_arg, pcl_list);
+			string fileName = CMeta::Basename(sArgs.pclinput_arg);
+			string fileStem = CMeta::Deextension(fileName);
+			char outFile[125];
+			sprintf(outFile, "%s/%s.sinfo", sArgs.dir_out_arg, fileStem.c_str());
 
-			for(i=0; i<pcl_list.size(); i++){
-				string pclfile = pcl_dir + "/" + pcl_list[i] + ".bin";
+			string pclfile = sArgs.pclinput_arg;
+			if(!CMeta::IsExtension(pclfile, ".bin")){
+				fprintf(stderr, "Input file is not bin type!\n");
+				return 1;
+			}
+			
+			CPCL pcl;
+			if(!pcl.Open(pclfile.c_str())){
+				fprintf(stderr, "Error opening file\n");
+			}
+			CMeasurePearNorm pn;
+			CDat Dat;
+			int numG = pcl.GetGeneNames().size();
 
-				char outFile[125];
-				sprintf(outFile, "%s/%s.sinfo", sArgs.dir_out_arg, pcl_list[i].c_str());
-				fprintf(stderr, "H0\n");
+			vector<int> veciGenes;
+			veciGenes.resize(numG);
+			for (j = 0; j < veciGenes.size(); ++j)
+				veciGenes[j] = j;
+			int iOne, iTwo;
+			Dat.Open(pcl.GetGeneNames());
 
-				CPCL pcl;
-				//if(!pcl.Open(pclfile.c_str(), 2, false, false)){
-				if(!pcl.Open(pclfile.c_str())){
-					fprintf(stderr, "Error opening file\n");
-				}
-				/*
-				int totNumExperiments = pcl.GetExperiments() - 2;
-				if(totNumExperiments<=2){
-					vector<float> vv;
-					vv.resize(2);
-					vv[0] = CMeta::GetNaN();
-					vv[1] = CMeta::GetNaN();
-					CSeekTools::WriteArray(outFile, vv);
+			for (k = 0; k < Dat.GetGenes(); ++k)
+				for (j = (k + 1); j < Dat.GetGenes(); ++j)
+					Dat.Set(k, j, CMeta::GetNaN());
+			for (k = 0; k < numG; ++k) {
+				if ((iOne = veciGenes[k]) == -1)
 					continue;
-				}
-
-				vector<ushort> presentIndex;
-				for(j=0; j<vecstrGenes.size(); j++){
-					ushort g = pcl.GetGene(vecstrGenes[j]);
-					if(CSeekTools::IsNaN(g)) continue; //gene does not exist in the dataset
-					presentIndex.push_back(g);
-				}
-				map<ushort, float> mean_d;
-				map<ushort, float> stdev_d;
-				
-				vector<float> all_correlations;
-
-				*/
-				CMeasurePearNorm pn;
-				CDat Dat;
-				int numG = pcl.GetGeneNames().size();
-
-				fprintf(stderr, "H1\n");
-
-				vector<int> veciGenes;
-				veciGenes.resize(numG);
-				for (j = 0; j < veciGenes.size(); ++j)
-					veciGenes[j] = j;
-				int iOne, iTwo;
-				fprintf(stderr, "H2\n");
-				Dat.Open(pcl.GetGeneNames());
-
-				for (k = 0; k < Dat.GetGenes(); ++k)
-					for (j = (k + 1); j < Dat.GetGenes(); ++j)
-						Dat.Set(k, j, CMeta::GetNaN());
-				fprintf(stderr, "H3\n");
-				for (k = 0; k < numG; ++k) {
-					if ((iOne = veciGenes[k]) == -1)
-						continue;
-					float *adOne = &pcl.Get(iOne)[2];
-					for (j = (k + 1); j < numG; ++j){
-						if ((iTwo = veciGenes[j]) != -1){
-							float x = (float) pn.Measure(adOne,
-								pcl.GetExperiments()-2, &pcl.Get(iTwo)[2],
-								pcl.GetExperiments()-2);
-							Dat.Set(k, j, x);
-							//fprintf(stderr, "%d %d %.5f\n", k, j, x);
-						}
+				float *adOne = &pcl.Get(iOne)[0];
+				for (j = (k + 1); j < numG; ++j){
+					if ((iTwo = veciGenes[j]) != -1){
+						float x = (float) pn.Measure(adOne,
+							pcl.GetExperiments(), &pcl.Get(iTwo)[0],
+							pcl.GetExperiments());
+						Dat.Set(k, j, x);
+						//fprintf(stderr, "%d %d %.5f\n", k, j, x);
 					}
 				}
-				fprintf(stderr, "H4\n");
+			}
 			
-				double gmean, gstdev;
-				size_t iN;
+			double gmean, gstdev;
+			size_t iN;
 
-				Dat.AveStd(gmean, gstdev, iN);
-				fprintf(stderr, "%.5f %.5f\n", (float) gmean, (float) gstdev);
-				
-				/*
-				for(j=0; j<presentIndex.size(); j++){
-					float *val = pcl.Get(presentIndex[j]);
-					vector<float> rowVal;
-					for(k=2; k<pcl.GetExperiments(); k++)
-						rowVal.push_back(val[k]);
-					float mean = 0;
-					float stdev = 0;
-					for(k=0; k<rowVal.size(); k++)
-						mean+=rowVal[k];
-					mean/=rowVal.size();
-					for(k=0; k<rowVal.size(); k++)
-						stdev += (rowVal[k] - mean) * (rowVal[k] - mean);
-					stdev /= rowVal.size();
-					stdev = sqrt(stdev);
-					mean_d[presentIndex[j]] = mean;
-					stdev_d[presentIndex[j]] = stdev;
-				}
-
-				for(j=0; j<presentIndex.size(); j++){
-					float *val = pcl.Get(presentIndex[j]);
-					vector<float> rowVal;
-					for(k=2; k<pcl.GetExperiments(); k++)
-						rowVal.push_back(val[k]);
-					float mean = mean_d[presentIndex[j]];
-					float stdev = stdev_d[presentIndex[j]];
-
-					for(k=j+1; k<presentIndex.size(); k++){
-						float *val2 = pcl.Get(presentIndex[k]);
-						vector<float> rowVal2;
-						for(l=2; l<pcl.GetExperiments(); l++)
-							rowVal2.push_back(val2[l]);
-						float mean2 = mean_d[presentIndex[k]];
-						float stdev2 = stdev_d[presentIndex[k]];
-						float r = 0;
-						for(l=0; l<rowVal.size(); l++)
-							r+=(rowVal[l] - mean)*(rowVal2[l] - mean2);
-						r /= stdev*stdev2;
-						r /= rowVal.size();
-						if(isinf(r) || isnan(r))
-							continue;
-
-						if(fabs(r) >= 0.9999f){
-							r*=0.9999f;
-						}
-						
-						r = 0.5 * log((1.0+r) / (1.0-r));
-
-						if(!(r<=5.0 && r>=-5.0))
-							continue;
-
-
-						fprintf(stderr, "%d %d %.5f\n", j, k, r);
-						all_correlations.push_back(r);
-					}
-				}
-
-				vector<float>::const_iterator iterF;
-				double global_mean = 0;
-				double global_stdev = 0;
-				for(iterF = all_correlations.begin(); iterF!=all_correlations.end(); iterF++)
-					global_mean+=(double) (*iterF);
-				global_mean /= (double) all_correlations.size();
-				for(iterF = all_correlations.begin(); iterF!=all_correlations.end(); iterF++)
-					global_stdev+=((double) (*iterF) - global_mean) * ((double) (*iterF) - global_mean);
-				global_stdev /= (double) (all_correlations.size() - 1);
-				global_stdev = sqrt(global_stdev);
-				float gstdev = (float) global_stdev;
-				float gmean = (float) global_mean;
-				if(all_correlations.size()==0){
-					gstdev = CMeta::GetNaN();
-					gmean = CMeta::GetNaN();
-				}
-				fprintf(stderr, "%.5f %.5f\n", gmean, gstdev);
-				*/
-				vector<float> vv;
-				vv.resize(2);
-				vv[0] = (float) gmean;
-				vv[1] = (float) gstdev;
-				CSeekTools::WriteArray(outFile, vv);
-			}
+			Dat.AveStd(gmean, gstdev, iN);
+			fprintf(stderr, "%.5f %.5f\n", (float) gmean, (float) gstdev);
+			vector<float> vv;
+			vv.resize(2);
+			vv[0] = (float) gmean;
+			vv[1] = (float) gstdev;
+			CSeekTools::WriteArray(outFile, vv);
 
 		}
 
 		//if calculating gene variance per dataset
 		else if(sArgs.gexpvarmean_flag==1){
-			string pcl_dir = sArgs.pcl_dir_arg;
-			vector<string> pcl_list;
-			CSeekTools::ReadListOneColumn(sArgs.pcl_list_arg, pcl_list);
+			string fileName = CMeta::Basename(sArgs.pclinput_arg);
+			string fileStem = CMeta::Deextension(fileName);
+			char outFile[125];
 
-			vector<vector<float> > var;
-			var.resize(pcl_list.size());
-			vector<vector<float> > avg;
-			avg.resize(pcl_list.size());
+			string pclfile = sArgs.pclinput_arg;
+			if(!CMeta::IsExtension(pclfile, ".bin")){
+				fprintf(stderr, "Input file is not bin type!\n");
+				return 1;
+			}
+			
+			CPCL pcl;
+			if(!pcl.Open(pclfile.c_str())){
+				fprintf(stderr, "Error opening file\n");
+				return 1;
+			}
 
-			for(i=0; i<pcl_list.size(); i++){
-				string pclfile = pcl_dir + "/" + pcl_list[i] + ".bin";
-				CPCL pcl;
-				pcl.Open(pclfile.c_str());
-				//cerr << pclfile << endl;
+			vector<float> var;
+			vector<float> avg;
 
-				var[i] = vector<float>();
-				avg[i] = vector<float>();
-				CSeekTools::InitVector(var[i], vecstrGenes.size(), (float) CMeta::GetNaN());
-				CSeekTools::InitVector(avg[i], vecstrGenes.size(), (float) CMeta::GetNaN());
-				int totNumExperiments = pcl.GetExperiments() - 2;
-				if(totNumExperiments<=2) continue;
+			CSeekTools::InitVector(var, vecstrGenes.size(), (float) CMeta::GetNaN());
+			CSeekTools::InitVector(avg, vecstrGenes.size(), (float) CMeta::GetNaN());
 
+			int totNumExperiments = pcl.GetExperiments();
+			if(totNumExperiments<=2){
+				fprintf(stderr, "This dataset is skipped because it contains <=2 columns\n");
+				fprintf(stderr, "An empty vector will be returned\n");
+			}else{
 				for(j=0; j<vecstrGenes.size(); j++){
 					ushort g = pcl.GetGene(vecstrGenes[j]);
 					if(CSeekTools::IsNaN(g)) continue; //gene does not exist in the dataset
 					float *val = pcl.Get(g);
 					vector<float> rowVal;
-					for(k=2; k<pcl.GetExperiments(); k++)
+					for(k=0; k<totNumExperiments; k++)
 						rowVal.push_back(val[k]);
-
 					float mean = 0;
 					float variance = 0;
 					for(k=0; k<rowVal.size(); k++)
 					for(k=0; k<rowVal.size(); k++)
 						variance += (rowVal[k] - mean) * (rowVal[k] - mean);
 					variance /= rowVal.size();
-					var[i][j] = variance;
-					avg[i][j] = mean;
+					var[j] = variance;
+					avg[j] = mean;
 					//fprintf(stderr, "%.5f %.5f\n", mean, variance);
 				}
 				//fprintf(stderr, "done\n"); 
 			}
 			//fprintf(stderr, "G\n"); 
-			for(i=0; i<pcl_list.size(); i++){
-				string dirout = sArgs.dir_out_arg;
-				string outfile = dirout + "/" + pcl_list[i] + ".gexpvar";
-				CSeekTools::WriteArray(outfile.c_str(), var[i]);
-				outfile = dirout + "/" + pcl_list[i] + ".gexpmean";
-				CSeekTools::WriteArray(outfile.c_str(), avg[i]);
-			}
+			sprintf(outFile, "%s/%s.gexpvar", sArgs.dir_out_arg, fileStem.c_str());
+			CSeekTools::WriteArray(outFile, var);
+			sprintf(outFile, "%s/%s.gexpmean", sArgs.dir_out_arg, fileStem.c_str());
+			CSeekTools::WriteArray(outFile, avg);
 		}
 
 	}
 			//printf("Dataset initialized"); getchar();
 			vector<string> vecstrQuery;
 
-			#pragma omp parallel for \
+			//#pragma omp parallel for \
 			shared(vc, dblist, iDatasets, m_iGenes, vecstrGenes, mapiPlatform, quant, \
 			platform_avg_threads, platform_stdev_threads, vecstrQuery, logit) \
 			private(i) firstprivate(useNibble) schedule(dynamic)
 				}
 			}
 
-			for(i=0; i<numPlatforms; i++){
+			//for(i=0; i<numPlatforms; i++){
 				//printf("Platform %s\n", mapistrPlatform[i].c_str());
 				/*for(j=0; j<vecstrQuery.size(); j++){
 					size_t iGene = mapstriGenes[vecstrQuery[j]];
 					printf("Gene %s %.5f %.5f\n", vecstrQuery[j].c_str(), platform_avg.Get(i, iGene),
 						platform_stdev.Get(i,iGene));
 				}*/
-			}
+			//}
 
 			char outFile[125];
 			sprintf(outFile, "%s/all_platforms.gplatavg", sArgs.dir_out_arg);

File tools/SeekPrep/SeekPrep.ggo

 section "Mode"
 option	"dab"				d	"DAB mode, suitable for dataset wide gene average and stdev calculation"
 								flag	off
-option	"pcl"				e	"PCL mode, suitable for dataset gene variance calculation"
+option	"pclbin"			e	"PCL BIN mode, suitable for dataset gene variance calculation"
 								flag	off
 option	"db"				f	"DB mode, suitable for platform wide gene average and stdev calculation"
 								flag	off
 								float	default="1.0"
 
 section "PCL mode"
-option	"pcl_list"			V	"PCL list"
+option	"pclinput"			V	"PCL BIN file"
 								string typestr="filename"
-option	"pcl_dir"			F	"PCL directory"
-								string typestr="directory"
 option	"gexpvarmean"		v	"Generates gene expression variance and mean files (.gexpvar, .gexpmean)"
 								flag	off
 option	"sinfo"				s	"Generates sinfo file (dataset z score mean and stdev)"

File tools/SeekPrep/cmdline.c

 /*
-  File autogenerated by gengetopt version 2.22
+  File autogenerated by gengetopt version 2.22.5
   generated with the following command:
-  /memex/qzhu/usr/bin/gengetopt -iSeekPrep.ggo --default-optional -u -N -e 
+  /usr/bin/gengetopt -iSeekPrep.ggo --default-optional -u -N -e 
 
   The developers of gengetopt consider the fixed text that goes in all
   gengetopt output files to be in the public domain:
 #include <stdlib.h>
 #include <string.h>
 
-#include "getopt.h"
+#ifndef FIX_UNUSED
+#define FIX_UNUSED(X) (void) (X) /* avoid warnings for unused params */
+#endif
+
+#include <getopt.h>
 
 #include "cmdline.h"
 
   "      --version                Print version and exit",
   "\nMode:",
   "  -d, --dab                    DAB mode, suitable for dataset wide gene average \n                                 and stdev calculation  (default=off)",
-  "  -e, --pcl                    PCL mode, suitable for dataset gene variance \n                                 calculation  (default=off)",
+  "  -e, --pclbin                 PCL BIN mode, suitable for dataset gene variance \n                                 calculation  (default=off)",
   "  -f, --db                     DB mode, suitable for platform wide gene average \n                                 and stdev calculation  (default=off)",
   "\nDAB mode:",
   "  -a, --gavg                   Generates gene average file  (default=off)",
   "  -B, --dabinput=filename      DAB dataset file",
   "  -C, --top_avg_percent=FLOAT  For gene average, top X percent of the values to \n                                 take average (0 - 1.0)  (default=`1.0')",
   "\nPCL mode:",
-  "  -V, --pcl_list=filename      PCL list",
-  "  -F, --pcl_dir=directory      PCL directory",
+  "  -V, --pclinput=filename      PCL BIN file",
   "  -v, --gexpvarmean            Generates gene expression variance and mean \n                                 files (.gexpvar, .gexpmean)  (default=off)",
   "  -s, --sinfo                  Generates sinfo file (dataset z score mean and \n                                 stdev)  (default=off)",
   "\nDB mode:",
 void clear_args (struct gengetopt_args_info *args_info);
 
 static int
-cmdline_parser_internal (int argc, char * const *argv, struct gengetopt_args_info *args_info,
+cmdline_parser_internal (int argc, char **argv, struct gengetopt_args_info *args_info,
                         struct cmdline_parser_params *params, const char *additional_error);
 
 static int
   args_info->help_given = 0 ;
   args_info->version_given = 0 ;
   args_info->dab_given = 0 ;
-  args_info->pcl_given = 0 ;
+  args_info->pclbin_given = 0 ;
   args_info->db_given = 0 ;
   args_info->gavg_given = 0 ;
   args_info->gpres_given = 0 ;
   args_info->dabinput_given = 0 ;
   args_info->top_avg_percent_given = 0 ;
-  args_info->pcl_list_given = 0 ;
-  args_info->pcl_dir_given = 0 ;
+  args_info->pclinput_given = 0 ;
   args_info->gexpvarmean_given = 0 ;
   args_info->sinfo_given = 0 ;
   args_info->gplat_given = 0 ;
 static
 void clear_args (struct gengetopt_args_info *args_info)
 {
+  FIX_UNUSED (args_info);
   args_info->dab_flag = 0;
-  args_info->pcl_flag = 0;
+  args_info->pclbin_flag = 0;
   args_info->db_flag = 0;
   args_info->gavg_flag = 0;
   args_info->gpres_flag = 0;
   args_info->dabinput_orig = NULL;
   args_info->top_avg_percent_arg = 1.0;
   args_info->top_avg_percent_orig = NULL;
-  args_info->pcl_list_arg = NULL;
-  args_info->pcl_list_orig = NULL;
-  args_info->pcl_dir_arg = NULL;
-  args_info->pcl_dir_orig = NULL;
+  args_info->pclinput_arg = NULL;
+  args_info->pclinput_orig = NULL;
   args_info->gexpvarmean_flag = 0;
   args_info->sinfo_flag = 0;
   args_info->gplat_flag = 0;
   args_info->help_help = gengetopt_args_info_help[0] ;
   args_info->version_help = gengetopt_args_info_help[1] ;
   args_info->dab_help = gengetopt_args_info_help[3] ;
-  args_info->pcl_help = gengetopt_args_info_help[4] ;
+  args_info->pclbin_help = gengetopt_args_info_help[4] ;
   args_info->db_help = gengetopt_args_info_help[5] ;
   args_info->gavg_help = gengetopt_args_info_help[7] ;
   args_info->gpres_help = gengetopt_args_info_help[8] ;
   args_info->dabinput_help = gengetopt_args_info_help[9] ;
   args_info->top_avg_percent_help = gengetopt_args_info_help[10] ;
-  args_info->pcl_list_help = gengetopt_args_info_help[12] ;
-  args_info->pcl_dir_help = gengetopt_args_info_help[13] ;
-  args_info->gexpvarmean_help = gengetopt_args_info_help[14] ;
-  args_info->sinfo_help = gengetopt_args_info_help[15] ;
-  args_info->gplat_help = gengetopt_args_info_help[17] ;
-  args_info->dblist_help = gengetopt_args_info_help[18] ;
-  args_info->dir_prep_in_help = gengetopt_args_info_help[19] ;
-  args_info->dset_help = gengetopt_args_info_help[20] ;
-  args_info->useNibble_help = gengetopt_args_info_help[21] ;
-  args_info->quant_help = gengetopt_args_info_help[22] ;
-  args_info->logit_help = gengetopt_args_info_help[24] ;
-  args_info->input_help = gengetopt_args_info_help[26] ;
-  args_info->dir_out_help = gengetopt_args_info_help[28] ;
+  args_info->pclinput_help = gengetopt_args_info_help[12] ;
+  args_info->gexpvarmean_help = gengetopt_args_info_help[13] ;
+  args_info->sinfo_help = gengetopt_args_info_help[14] ;
+  args_info->gplat_help = gengetopt_args_info_help[16] ;
+  args_info->dblist_help = gengetopt_args_info_help[17] ;
+  args_info->dir_prep_in_help = gengetopt_args_info_help[18] ;
+  args_info->dset_help = gengetopt_args_info_help[19] ;
+  args_info->useNibble_help = gengetopt_args_info_help[20] ;
+  args_info->quant_help = gengetopt_args_info_help[21] ;
+  args_info->logit_help = gengetopt_args_info_help[23] ;
+  args_info->input_help = gengetopt_args_info_help[25] ;
+  args_info->dir_out_help = gengetopt_args_info_help[27] ;
   
 }
 
 void
 cmdline_parser_print_version (void)
 {
-  printf ("%s %s\n", CMDLINE_PARSER_PACKAGE, CMDLINE_PARSER_VERSION);
+  printf ("%s %s\n",
+     (strlen(CMDLINE_PARSER_PACKAGE_NAME) ? CMDLINE_PARSER_PACKAGE_NAME : CMDLINE_PARSER_PACKAGE),
+     CMDLINE_PARSER_VERSION);
 }
 
 static void print_help_common(void) {
   printf("\n");
 
   if (strlen(gengetopt_args_info_description) > 0)
-    printf("%s\n", gengetopt_args_info_description);
+    printf("%s\n\n", gengetopt_args_info_description);
 }
 
 void
   clear_args (args_info);
   init_args_info (args_info);
 
-  args_info->inputs = NULL;
+  args_info->inputs = 0;
   args_info->inputs_num = 0;
 }
 
   free_string_field (&(args_info->dabinput_arg));
   free_string_field (&(args_info->dabinput_orig));
   free_string_field (&(args_info->top_avg_percent_orig));
-  free_string_field (&(args_info->pcl_list_arg));
-  free_string_field (&(args_info->pcl_list_orig));
-  free_string_field (&(args_info->pcl_dir_arg));
-  free_string_field (&(args_info->pcl_dir_orig));
+  free_string_field (&(args_info->pclinput_arg));
+  free_string_field (&(args_info->pclinput_orig));
   free_string_field (&(args_info->dblist_arg));
   free_string_field (&(args_info->dblist_orig));
   free_string_field (&(args_info->dir_prep_in_arg));
 
 
 static void
-write_into_file(FILE *outfile, const char *opt, const char *arg, char *values[])
+write_into_file(FILE *outfile, const char *opt, const char *arg, const char *values[])
 {
+  FIX_UNUSED (values);
   if (arg) {
     fprintf(outfile, "%s=\"%s\"\n", opt, arg);
   } else {
     write_into_file(outfile, "version", 0, 0 );
   if (args_info->dab_given)
     write_into_file(outfile, "dab", 0, 0 );
-  if (args_info->pcl_given)
-    write_into_file(outfile, "pcl", 0, 0 );
+  if (args_info->pclbin_given)
+    write_into_file(outfile, "pclbin", 0, 0 );
   if (args_info->db_given)
     write_into_file(outfile, "db", 0, 0 );
   if (args_info->gavg_given)
     write_into_file(outfile, "dabinput", args_info->dabinput_orig, 0);
   if (args_info->top_avg_percent_given)
     write_into_file(outfile, "top_avg_percent", args_info->top_avg_percent_orig, 0);
-  if (args_info->pcl_list_given)
-    write_into_file(outfile, "pcl_list", args_info->pcl_list_orig, 0);
-  if (args_info->pcl_dir_given)
-    write_into_file(outfile, "pcl_dir", args_info->pcl_dir_orig, 0);
+  if (args_info->pclinput_given)
+    write_into_file(outfile, "pclinput", args_info->pclinput_orig, 0);
   if (args_info->gexpvarmean_given)
     write_into_file(outfile, "gexpvarmean", 0, 0 );
   if (args_info->sinfo_given)
 char *
 gengetopt_strdup (const char *s)
 {
-  char *result = NULL;
+  char *result = 0;
   if (!s)
     return result;
 
 }
 
 int
-cmdline_parser (int argc, char * const *argv, struct gengetopt_args_info *args_info)
+cmdline_parser (int argc, char **argv, struct gengetopt_args_info *args_info)
 {
   return cmdline_parser2 (argc, argv, args_info, 0, 1, 1);
 }
 
 int
-cmdline_parser_ext (int argc, char * const *argv, struct gengetopt_args_info *args_info,
+cmdline_parser_ext (int argc, char **argv, struct gengetopt_args_info *args_info,
                    struct cmdline_parser_params *params)
 {
   int result;
-  result = cmdline_parser_internal (argc, argv, args_info, params, NULL);
+  result = cmdline_parser_internal (argc, argv, args_info, params, 0);
 
   return result;
 }
 
 int
-cmdline_parser2 (int argc, char * const *argv, struct gengetopt_args_info *args_info, int override, int initialize, int check_required)
+cmdline_parser2 (int argc, char **argv, struct gengetopt_args_info *args_info, int override, int initialize, int check_required)
 {
   int result;
   struct cmdline_parser_params params;
   params.check_ambiguity = 0;
   params.print_errors = 1;
 
-  result = cmdline_parser_internal (argc, argv, args_info, &params, NULL);
+  result = cmdline_parser_internal (argc, argv, args_info, &params, 0);
 
   return result;
 }
 {
   int result = EXIT_SUCCESS;
 
-  if (cmdline_parser_required2(args_info, prog_name, NULL) > 0)
+  if (cmdline_parser_required2(args_info, prog_name, 0) > 0)
     result = EXIT_FAILURE;
 
   return result;
 cmdline_parser_required2 (struct gengetopt_args_info *args_info, const char *prog_name, const char *additional_error)
 {
   int error = 0;
+  FIX_UNUSED (additional_error);
 
   /* checks for required options */
   if (! args_info->input_given)
 static
 int update_arg(void *field, char **orig_field,
                unsigned int *field_given, unsigned int *prev_given, 
-               char *value, char *possible_values[], const char *default_value,
+               char *value, const char *possible_values[],
+               const char *default_value,
                cmdline_parser_arg_type arg_type,
                int check_ambiguity, int override,
                int no_free, int multiple_option,
   const char *val = value;
   int found;
   char **string_field;
+  FIX_UNUSED (field);
 
   stop_char = 0;
   found = 0;
       return 1; /* failure */
     }
 
+  FIX_UNUSED (default_value);
     
   if (field_given && *field_given && ! override)
     return 0;
 
 
 int
-cmdline_parser_internal (int argc, char * const *argv, struct gengetopt_args_info *args_info,
+cmdline_parser_internal (
+  int argc, char **argv, struct gengetopt_args_info *args_info,
                         struct cmdline_parser_params *params, const char *additional_error)
 {
   int c;	/* Character of the parsed option.  */
         { "help",	0, NULL, 'h' },
         { "version",	0, NULL, 0 },
         { "dab",	0, NULL, 'd' },
-        { "pcl",	0, NULL, 'e' },
+        { "pclbin",	0, NULL, 'e' },
         { "db",	0, NULL, 'f' },
         { "gavg",	0, NULL, 'a' },
         { "gpres",	0, NULL, 'p' },
         { "dabinput",	1, NULL, 'B' },
         { "top_avg_percent",	1, NULL, 'C' },
-        { "pcl_list",	1, NULL, 'V' },
-        { "pcl_dir",	1, NULL, 'F' },
+        { "pclinput",	1, NULL, 'V' },
         { "gexpvarmean",	0, NULL, 'v' },
         { "sinfo",	0, NULL, 's' },
         { "gplat",	0, NULL, 'P' },
         { "logit",	0, NULL, 'l' },
         { "input",	1, NULL, 'i' },
         { "dir_out",	1, NULL, 'D' },
-        { NULL,	0, NULL, 0 }
+        { 0,  0, 0, 0 }
       };
 
-      c = getopt_long (argc, argv, "hdefapB:C:V:F:vsPb:I:A:NQ:li:D:", long_options, &option_index);
+      c = getopt_long (argc, argv, "hdefapB:C:V:vsPb:I:A:NQ:li:D:", long_options, &option_index);
 
       if (c == -1) break;	/* Exit from `while (1)' loop.  */
 
             goto failure;
         
           break;
-        case 'e':	/* PCL mode, suitable for dataset gene variance calculation.  */
+        case 'e':	/* PCL BIN mode, suitable for dataset gene variance calculation.  */
         
         
-          if (update_arg((void *)&(args_info->pcl_flag), 0, &(args_info->pcl_given),
-              &(local_args_info.pcl_given), optarg, 0, 0, ARG_FLAG,
-              check_ambiguity, override, 1, 0, "pcl", 'e',
+          if (update_arg((void *)&(args_info->pclbin_flag), 0, &(args_info->pclbin_given),
+              &(local_args_info.pclbin_given), optarg, 0, 0, ARG_FLAG,
+              check_ambiguity, override, 1, 0, "pclbin", 'e',
               additional_error))
             goto failure;
         
             goto failure;
         
           break;
-        case 'V':	/* PCL list.  */
+        case 'V':	/* PCL BIN file.  */
         
         
-          if (update_arg( (void *)&(args_info->pcl_list_arg), 
-               &(args_info->pcl_list_orig), &(args_info->pcl_list_given),
-              &(local_args_info.pcl_list_given), optarg, 0, 0, ARG_STRING,
+          if (update_arg( (void *)&(args_info->pclinput_arg), 
+               &(args_info->pclinput_orig), &(args_info->pclinput_given),
+              &(local_args_info.pclinput_given), optarg, 0, 0, ARG_STRING,
               check_ambiguity, override, 0, 0,
-              "pcl_list", 'V',
-              additional_error))
-            goto failure;
-        
-          break;
-        case 'F':	/* PCL directory.  */
-        
-        
-          if (update_arg( (void *)&(args_info->pcl_dir_arg), 
-               &(args_info->pcl_dir_orig), &(args_info->pcl_dir_given),
-              &(local_args_info.pcl_dir_given), optarg, 0, 0, ARG_STRING,
-              check_ambiguity, override, 0, 0,
-              "pcl_dir", 'F',
+              "pclinput", 'V',
               additional_error))
             goto failure;
         

File tools/SeekPrep/cmdline.h

 /** @file cmdline.h
  *  @brief The header file for the command line option parser
- *  generated by GNU Gengetopt version 2.22
+ *  generated by GNU Gengetopt version 2.22.5
  *  http://www.gnu.org/software/gengetopt.
  *  DO NOT modify this file, since it can be overwritten
  *  @author GNU Gengetopt by Lorenzo Bettini */
 #endif /* __cplusplus */
 
 #ifndef CMDLINE_PARSER_PACKAGE
-/** @brief the program name */
+/** @brief the program name (used for printing errors) */
 #define CMDLINE_PARSER_PACKAGE "SeekPrep"
 #endif
 
+#ifndef CMDLINE_PARSER_PACKAGE_NAME
+/** @brief the complete program name (used for help and version) */
+#define CMDLINE_PARSER_PACKAGE_NAME "SeekPrep"
+#endif
+
 #ifndef CMDLINE_PARSER_VERSION
 /** @brief the program version */
 #define CMDLINE_PARSER_VERSION "1.0"
   const char *version_help; /**< @brief Print version and exit help description.  */
   int dab_flag;	/**< @brief DAB mode, suitable for dataset wide gene average and stdev calculation (default=off).  */
   const char *dab_help; /**< @brief DAB mode, suitable for dataset wide gene average and stdev calculation help description.  */
-  int pcl_flag;	/**< @brief PCL mode, suitable for dataset gene variance calculation (default=off).  */
-  const char *pcl_help; /**< @brief PCL mode, suitable for dataset gene variance calculation help description.  */
+  int pclbin_flag;	/**< @brief PCL BIN mode, suitable for dataset gene variance calculation (default=off).  */
+  const char *pclbin_help; /**< @brief PCL BIN mode, suitable for dataset gene variance calculation help description.  */
   int db_flag;	/**< @brief DB mode, suitable for platform wide gene average and stdev calculation (default=off).  */
   const char *db_help; /**< @brief DB mode, suitable for platform wide gene average and stdev calculation help description.  */
   int gavg_flag;	/**< @brief Generates gene average file (default=off).  */
   float top_avg_percent_arg;	/**< @brief For gene average, top X percent of the values to take average (0 - 1.0) (default='1.0').  */
   char * top_avg_percent_orig;	/**< @brief For gene average, top X percent of the values to take average (0 - 1.0) original value given at command line.  */
   const char *top_avg_percent_help; /**< @brief For gene average, top X percent of the values to take average (0 - 1.0) help description.  */
-  char * pcl_list_arg;	/**< @brief PCL list.  */
-  char * pcl_list_orig;	/**< @brief PCL list original value given at command line.  */
-  const char *pcl_list_help; /**< @brief PCL list help description.  */
-  char * pcl_dir_arg;	/**< @brief PCL directory.  */
-  char * pcl_dir_orig;	/**< @brief PCL directory original value given at command line.  */
-  const char *pcl_dir_help; /**< @brief PCL directory help description.  */
+  char * pclinput_arg;	/**< @brief PCL BIN file.  */
+  char * pclinput_orig;	/**< @brief PCL BIN file original value given at command line.  */
+  const char *pclinput_help; /**< @brief PCL BIN file help description.  */
   int gexpvarmean_flag;	/**< @brief Generates gene expression variance and mean files (.gexpvar, .gexpmean) (default=off).  */
   const char *gexpvarmean_help; /**< @brief Generates gene expression variance and mean files (.gexpvar, .gexpmean) help description.  */
   int sinfo_flag;	/**< @brief Generates sinfo file (dataset z score mean and stdev) (default=off).  */
   unsigned int help_given ;	/**< @brief Whether help was given.  */
   unsigned int version_given ;	/**< @brief Whether version was given.  */
   unsigned int dab_given ;	/**< @brief Whether dab was given.  */
-  unsigned int pcl_given ;	/**< @brief Whether pcl was given.  */
+  unsigned int pclbin_given ;	/**< @brief Whether pclbin was given.  */
   unsigned int db_given ;	/**< @brief Whether db was given.  */
   unsigned int gavg_given ;	/**< @brief Whether gavg was given.  */
   unsigned int gpres_given ;	/**< @brief Whether gpres was given.  */
   unsigned int dabinput_given ;	/**< @brief Whether dabinput was given.  */
   unsigned int top_avg_percent_given ;	/**< @brief Whether top_avg_percent was given.  */
-  unsigned int pcl_list_given ;	/**< @brief Whether pcl_list was given.  */
-  unsigned int pcl_dir_given ;	/**< @brief Whether pcl_dir was given.  */
+  unsigned int pclinput_given ;	/**< @brief Whether pclinput was given.  */
   unsigned int gexpvarmean_given ;	/**< @brief Whether gexpvarmean was given.  */
   unsigned int sinfo_given ;	/**< @brief Whether sinfo was given.  */
   unsigned int gplat_given ;	/**< @brief Whether gplat was given.  */
  * @param args_info the structure where option information will be stored
  * @return 0 if everything went fine, NON 0 if an error took place
  */
-int cmdline_parser (int argc, char * const *argv,
+int cmdline_parser (int argc, char **argv,
   struct gengetopt_args_info *args_info);
 
 /**
  * @return 0 if everything went fine, NON 0 if an error took place
  * @deprecated use cmdline_parser_ext() instead
  */
-int cmdline_parser2 (int argc, char * const *argv,
+int cmdline_parser2 (int argc, char **argv,
   struct gengetopt_args_info *args_info,
   int override, int initialize, int check_required);
 
  * @param params additional parameters for the parser
  * @return 0 if everything went fine, NON 0 if an error took place
  */
-int cmdline_parser_ext (int argc, char * const *argv,
+int cmdline_parser_ext (int argc, char **argv,
   struct gengetopt_args_info *args_info,
   struct cmdline_parser_params *params);
 

File tools/SeekPrep/stdafx.cpp

 /*!
  * \page SeekPrep SeekPrep
  * 
+ * Prepares prerequisite files that are necessary for the efficient integrations
+ * of coexpressions in \ref SeekMiner and \ref SeekServer.
+ * Some of the file preparation tasks that SeekPrep performs are:
+ * preparing gene-presence file, calculating gene average correlation, 
+ * calculating gene expression variances for each dataset. 
+ *
  * 
  * \section sec_usage Usage
  * 
  * \subsection ssec_usage_basic Basic Usage
  * 
+ * \subsubsection ssec_usage_avg Prepare Gene Average File (GAVG)
  * \code
- * SeekPrep -i <genes.txt> -x <db list> -d <input directory> -D <output_dir>
+ * SeekPrep -i <gene_map> -d -B <dab_file> -a -D <output_dir>
  * \endcode
+ * Calculates the average z-score for each gene in a given DAB matrix and stores the results
+ * as a vector of floats in the GAVG file. The index of a gene in the vector is determined by \c gene_map.
  * 
+ * \subsubsection ssec_usage_pres Prepare Gene Presence File (GPRES)
+ * \code
+ * SeekPrep -i <gene_map> -d -B <dab_file> -p -D <output_dir>
+ * \endcode
+ * Stores the gene presence vector for a given DAB matrix, where each value is either
+ * 1 if the gene is present, or 0 if the gene is absent in the dataset.
+ *
+ * \subsubsection ssec_usage_sinfo Prepare Dataset Sinfo file (SINFO)
+ * \code
+ * SeekPrep -i <gene_map> -e -V <pclbin_file> -s -D <output_dir>
+ * \endcode
+ * Calculates the average Fisher's transformed correlation between all gene pairs in an input dataset.
+ * The input dataset needs to be a binary PCL file with the extension BIN (generated by \ref PCL2Bin).
+ *
+ * \subsubsection ssec_usage_gexpvar Prepare Dataset Gene Expression Variance file (GEXPVAR)
+ * \code
+ * SeekPrep -i <gene_map> -e -V <pclbin_file> -v -D <output_dir>
+ * \endcode
+ * Calculates the gene expression variance for each gene in an input dataset.
+ *
+ * \subsubsection ssec_usage_plat Prepare Platform average z-scores and their standard deviation (GPLAT)
+ * \code
+ * SeekPrep -i <gene_map> -f -P -b <db_file_list> -I <prep_dir> -A <dset_platform_map> -Q <quant>
+ * \endcode
+ * Calculates the platform-wide average of z-scores (\f$z_{p,avg}\f$) using the following algorithm: <br>
+ * For each dataset \f$d\f$: <br>
+ * &nbsp;&nbsp;&nbsp; For each gene \f$g\f$ in the genome \f$G\f$: <br>
+ * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Compute \f$z_{d, avg}(g) = (\sum_{i \in G}{z_{d}(g, i)}) / |G|\f$ <br>
+ * &nbsp;&nbsp;&nbsp; For each gene \f$k\f$ in the genome \f$G\f$: <br>
+ * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Compute \f$z_{d, avg, corrected}(k) = (\sum_{g \in G}{z_{d}(k, g) - z_{d, avg}(g)}) / |G|\f$ <br>
+ * For each platform \f$p\f$ and its set of dataset \f$D_p\f$: <br>
+ * &nbsp;&nbsp;&nbsp; For each gene \f$k\f$ in the genome \f$G\f$: <br>
+ * &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Compute \f$z_{p, avg}(k) = (\sum_{d \in D_p}{z_{d,avg,corrected}(k)}) / |D_p| \f$ <br>
+ *
+ * The \c prep_dir contains the GPRES and GAVG files for all datasets defined in \c dset_platform_map. (Users should generate these files with SeekPrep first.)
+ *
+ * The \c dset_platform_map is a tab-delimited file that looks something like:
+ * \code
+ * GSE15913.GPL570.pcl  GPL570
+ * GSE16122.GPL2005.pcl GPL2005
+ * GSE16797.GPL570.pcl  GPL570
+ * GSE16836.GPL570.pcl  GPL570
+ * GSE17351.GPL570.pcl  GPL570
+ * GSE17537.GPL570.pcl  GPL570
+ * \endcode
+ * where the 1st column is the dataset name and the 2nd column is the corresponding platform.
  * 
+ * The \c quant file is a space-delimited file that specifies how the z-scores are binned:
+ * \code
+ * -5.00 -4.96 -4.92 -4.88 -4.84 -4.80 -4.76 -4.72 -4.68 -4.64 -4.60 -4.56 -4.52 ...
+ * \endcode
+ *
+ * The \c db_file_list file is a list of file paths to the entire DB collections:
+ * \code
+ * /x/y/z/00000001.db 
+ * /x/y/z/00000002.db 
+ * /x/y/z/00000003.db 
+ * /x/y/z/00000004.db 
+ * /x/y/z/00000005.db
+ * ...
+ * \endcode 
+ *
  * \subsection ssec_usage_detailed Detailed Usage
  * 
  * \include SeekPrep/SeekPrep.ggo
  * 
- * <table><tr>
- *	<th>Flag</th>
- *	<th>Default</th>
- *	<th>Type</th>
- *	<th>Description</th>
- * </tr><tr>
- *	<td>-i</td>
- *	<td>stdin</td>
- *	<td>Text file</td>
- *	<td>Tab-delimited text file containing two columns, numerical gene IDs (one-based) and unique gene
- *		names (matching those in the input DAT/DAB files).</td>
- * </tr><tr>
- *	<td>-d</td>
- *	<td>.</td>
- *	<td>Directory</td>
- *	<td>Input directory containing DB files</td>
- * </tr><tr>
- *	<td>-D</td>
- *	<td>.</td>
- *	<td>Directory</td>
- *	<td>Output directory in which database files will be stored.</td>
- * </tr><tr>
- *	<td>-x</td>
- *	<td>.</td>
- *	<td>Text file</td>
- *	<td>Input file containing list of CDatabaselets to combine</td>
- * </tr></table>
  */

File tools/SeekServer/stdafx.cpp

  * \page SeekServer SeekServer
  * 
  * SeekServer runs the coexpression mining algorithm using a multithreaded TCP/IP interface.
- * When it is running, SeekServer services requests over the network from multiple connected clients
- * for genes that co-express with the client's query genes.
- * A list of genes that are found by the algorithm to be coexpressed with the query genes and a list of datasets
- * where this coexpression with the query is found to be occurring are sent back to the client.
+ * When it is running, SeekServer services multiple connected clients over the network on requests for
+ * genes that co-express with the client's query genes, and for datasets that are related to the query genes.
  * 
  * \section sec_usage Usage
  * 
  *
  * \subsubsection ssec_cl Client Request Format
  *
- * When a client request comes in, SeekServer looks for the following sequence of 4 strings that are sent by the client:
+ * When a client request comes in, SeekServer looks for the following sequence of 4 strings in the request message:
  *
  * \li \c strSearchDataset. Dataset names, as referred by the \c dset_platform_map, to be used for the search.
  * Delimited by " ".
  * \li \c strOutputDir. Output directory where intermediate results are generated. Must be a directory that the running user of
  * SeekServer has access to. \c /tmp is recommended.
  *
- * \li \c strSearchParameter. A string of the form "1_2_3_4" where each number denotes the following:
- * 1 - the search method, one of \c RBP, \c OrderStatistics, \c EqualWeighting <br>
- * 2 - rbp parameter p (a float 0.90 - 0.99). Recommended 0.99. <br>
+ * \li \c strSearchParameter. A string of the form "1_2_3_4" where each number denotes the following: <br>
+ * 1 - the search method, one of \c RBP, \c OrderStatistics, \c EqualWeighting. Recommended \c RBP (also known as the CV weighting). <br>
+ * 2 - rbp parameter \a p (a \c float 0.90 - 0.99). Recommended 0.99. <br>
  * 3 - minimum fraction of query required to score each dataset (0 - 1.0). Recommended 0 (no minimum). <br>
- * 4 - distance measure, one of \c Correlation, \c Zscore, \c ZscoreHubbinessCorrected. <br>
+ * 4 - distance measure, one of \c Correlation, \c Zscore, \c ZscoreHubbinessCorrected. Recommended \c ZscoreHubbinessCorrected.<br>
  *
- * See Sleipnir::CSeekNetwork for the specification of the format of an incoming string message.
+ * See Sleipnir::CSeekNetwork for the specification of an incoming string message.
  *
  * Once SeekServer correctly receives the above 4 strings, a search instance using the provided search parameters will
  * be initiated on the server side.
  * \subsubsection ssec_out Outgoing Message Format
  *
  * Each outgoing message is generated upon finishing searching the client's query. In general, if the search is successful,
- * the client expects two arrays from the SeekServer in sequence: a binary float array of dataset weights, and a binary float array
- * of gene scores. An element at index \a i in the dataset array represents the weight of the dataset with ID = \a i.
- * An element at index \a j in the gene array represents the score of the gene with ID = \a j.
+ * SeekServer will send to the clients these two arrays in sequence:
+ * \li a binary \c float array of <b>dataset weights</b>, indicating how datasets are related to the query.
+ * \li a binary \c float array of <b>gene scores</b>, indicating how genes are coexpressed with the query. 
  *
- * See Sleipnir::CSeekNetwork for the specification of the format of an outgoing float array.
+ * See Sleipnir::CSeekNetwork for the specification of an outgoing float array.
  *
  *
  * \subsubsection ssec_search Query-independent search setting files and directories
  *
  * These include the following: \c dset_platform_map, \c gene_map, \c db_dir, \c prep_dir, \c platform_dir, \c quant,
  * \c sinfo_dir.
- * For a discussion of these files and directories, please refer to the SeekMiner page in section:
+ * For a discussion of these files and directories, please refer to the \ref SeekMiner page in section:
  * Query-independent search setting files and directories.
  *
  *