Water Skimmer: Natural Language Processing and Entity Extraction for Understanding Water Issues from Public Discourse

This file should be accompanied by; expanding this file should give a single directory with 
8 subdirectories:


Using the code:

1) The easiest way to use this code is to use the Eclipse development environment, available at The
code has been tested with eclipse 'Kepler' version, but should work with others. Obtain a version of Eclipse and make
sure that the 'Groovy- Eclipse' and 'Groovy- SDK' plugins are installed. You can install these by going to Eclipse's
"Help... Install New Software.. " menu, adding as a site, and
installing the appropriate items from the list.

2) In Eclipse, 'Import' the entire collection of projects into the IDE by using the 'File... Import...' selecting the
'Existing Projects Into Workspace' option, then browsing to the 'CODE' directory in this distribution, and then
selecting all projects and choosing 'Finish'.

The result is that eight projects corresponding to the eight directories should be visible in the Eclipse 'Package
Explorer' and/or 'Navigator' windows.

Eclipse may indicate 'warnings', but there should be no errors; these projects are collectively self-contained and
should be buildable and executable.

3) You will need two database servers: 

A MongoDB that contains (unstructured) source data, and
An SQL-based database that can contain the (structured) output data and perform the analyses

You will also need a MongoDB instance to receive output; you may elect to use a different MongoDB database for this,
or you may use the same MongoDB database in which source data are stored.  Please note that the routines save output
to both the SQL and the MongoDB databases, but the MongoDB output is obsoleted and is not discussed in publication.

The MongoDB source database must contain a collection that includes data with fields shown in this example:

	"_id" : ObjectId("123456789abcde"),
	"title" : "Article Title",
	"location" : "http://articleURL.html",
	"date" : ISODate("2000-01-01T00:00:00Z"),
	"source" : "Name Of Newspaper",
	"authors" : [
			"firstName" : "",
			"lastName" : "",
			"initials" : ""
	"contents" : "Article Contents Go Here\nSeparate Lines with backslash-n\n",
	"length" : 57
The 'length' entry is the number of characters in 'contents', counting \n as 1. The 'authors' fields are not used and
can remain blank.

The destination database server must be SQL-Based; the analysis routine will create a new database on this server to
contain the analysis results.

4) You must add information to the following files:

File: CNHMiner/src/cnh/cnhminer/runner/
Line: 69
Add the name of the source collection in MongoDB; you may add a list of sources by defining elements of the static
String array named 'COLLECTIONS'

File: CNHMiner/desc/CPE_Data_To_Analysis.xml
Line: 12
Add the database name of the MongoDB source database

Line: 18
Add the IP Address of the MongoDB source database

Line: 22
Add the collection name of the MongoDB source database (optional- overridden by CNHMinerRunner)

Line: 106
Add the database name of the MongoDB output database

Line: 112
Add the collection name of the MongoDB output database (optional- overridden by CNHMinerRunner)

File: SQLComponents/desc/SQLOutputter.xml
Line: 54
Insert the class name of the JDBC driver you will be using to connect with your SQL database server

Line: 58
Insert some or all of the connection string that your JDBC driver will use to establish the connection with the SQL
server; often this is a connection type, e.g. "jdbc:sqlserver"

File: SQLComponents/src/cnh/sql/
Line: 15
Add the completion of the connection string (usually a host name or IP address) of the SQL database that will receive
the initial analysis results and perform the additional analyses

File: MongoDBComponents/desc/MongoDBOutAndScorer.xml
Line: 82
Insert the IP or Host address of the MongoDB that will receive the analysis output; this may be the same as the input

File: MongoDBComponents/desc/MongoDBSentimentOutputter.xml
Line: 63
Insert the IP or Host address of the MongoDB that will receive the analysis output; this may be the same as the input

File: MongoDBComponents/desc/ScorerAnnotater.xml
Line: 59
Insert the IP or Host address of the MongoDB that will receive the analysis output; this may be the same as the input

5) You must add a JDBC-Compliant library to your classpath; it should provide a class that matches the class that you
entered in SQLOutputter.xml at line 54.

To add it to your classpath, copy it into the lib directory of the CNHMiner project. Then open Eclipse's "Run... Run
Configurations..." dialog box, select 'CNHMinerRunner' from the list at the left, choose the 'Classpath' tab, and use
'Add Jars...' to select this library and add it.

Salient Code:

Java code that identifies potential water authority entities is contained in:


SQL code that performs Term Frequency (TF) and extraction of relevant water authority entities based TF is found in
two files:


Running Water Skimmer

To launch the analysis, use Eclipse's 'Run... Run Configurations...' menu and select 'CNHMinerRunner'.

You will be prompted with a dialog box that asks for a username and password for the MongoDB database. Then a second
box will open; you will be allowed to select from a list of options corresponding to those you added on line 15. You will also be asked for the username and password, and a prefix and suffix. The
prefix and suffix are added to the collection name (from line 71); the SQL database will DELETE
(yes, DELETE!) any existing database with the name PREFIX+collectionname+SUFFIX and then create a new one with that
name, which it will then populate with the analysis results.

Rebuilding TF scores

The basic analyses resulting in our customized TF scores are performed in the SQL database using routines that are
created as stored 
procedures in the database as part of the mining process (they can also be created manually from the source files in
SQLComponents/sql/sprocs directory). The procedure that does the work is:


The arguments are:

     Arg1: Global Weighting System used. This is the 'IDF' portion of the TF_IDF score, which (because it is a scalar
has no effect on the analyses presented here. However, for other analyses it may be relevant, and several options are
available, selectable via a numeric argument:

         0 =     LOG( N / (Ni + 1)) + 1                                                              AS CNH_IDF,
         1 =     SQRT( 1 / (SUM(LOCAL_FREQ * LOCAL_FREQ)) )                                          AS DUMAIS_NORMAL,
         2 =     GLOBAL_FREQ / Ni                                                                    AS DUMAIS_GFIDF,
         3 =     LOG( N / Ni) + 1                                                                    AS DUMAIS_IDF,
         4 =     1 - SUM( (LOCAL_FREQ / GLOBAL_FREQ) * LOG(LOCAL_FREQ/GLOBAL_FREQ) / LOG(N) )        AS
         5 =     LOG( N / (Ni))                                                                      AS CK_IDFB,
         6 =     LOG( ( N - Ni ) / Ni )                                                              AS CK_IDFP,
         7 =     1 + SUM( (LOCAL_FREQ / GLOBAL_FREQ) * LOG(LOCAL_FREQ/GLOBAL_FREQ) / LOG(N) )        AS CK_ENPY,
         8 =     GLOBAL_FREQ / Ni                                                                    AS CK_IGFF,
         9 =     LOG( (GLOBAL_FREQ / LOCAL_FREQ) + 1)                                                AS CK_IGFL,
         10 =    GLOBAL_FREQ / Ni + 1                                                                AS CK_IGFI,
         11 =    SQRT( ( GLOBAL_FREQ / Ni ) - 0.9)                                                   AS CK_IGFS

If any other value is specified, '1' is used.

     Arg2: Local Weighting System used. This is the 'TF' portion of the TF_IDF score. Several options are available,
selected via a numeric argument:

         0 =     SQRT(LOCAL_FREQ) * TITLE_BOOST                                                      AS CNH_TF,
         1 =     1                                                                                   AS DUMAIS_BINARY,
         2 =     LOCAL_FREQ                                                                          AS DUMAIS_FREQ,
         3 =     LOG(LOCAL_FREQ + 1)                                                                 AS DUMAIS_LOG,
         4 =     1                                                                                   AS CK_BINARY,
         5 =     LOCAL_FREQ                                                                          AS CK_FREQ,
         6 =     LOG(LOCAL_FREQ)                                                                     AS CK_LOG,
         7 =     (1 + LOG(LOCAL_FREQ))/( 1 + LOG(AVERAGE_FREQ) )                                     AS CK_LOGN,
         8 =     0.5 + 0.5 * LOCAL_FREQ / AVERAGE_FREQ                                               AS CK_AUGNORM

If any other value is specified, '1' is used.

     Arg3: A normalization method (used to correct for different document lengths). Selected as the previous two

         0 =     1/ SQRT(tokenCount)                                                              AS
If any other value is specified, '1' is used.

     Arg4: Boost value. This is the amount by which a keyword appearing in the title boosts a documents TF score.

     Arg5: Minimum number of sentences that a document must have to be considered a valid document.

     Arg6: Query Angle Fraction. Not used in the code (commented out), but refers to the vector-space model for
the relatedness of a document to two terms (instead of just one); if this is done, IDF is also a relevant variable. 

Performing Network Analyses

After the CNHMinerRunner routine is complete, a series of post-processing steps can be undertaken within the sql
database to arrive at the
network of interrelated water management authorities. The basic sequence is:

         call baseline();

This re-sets all analyses and clears all analysis tables.

         call water_authorities('WATER', 0, 2, 10, 100, 0);

This repeats all analyses, using the following arguments:

     Arg1: The keyword category to be used. For all analyses in published work we have used the general 'WATER'
category, but others (e.g. 'IRRIGATION' can also be used. Perform a SELECT DISTINCT WaterCategory FROM
water_category_tokens; to see the list of categories that appear in
your data set.

     Arg2: Uses a 'topicality' score instead of TF to determine document's relation to water. This should not be used
unless the definition of
topicality is understood and it is what is needed. A zero means use TF; any positive value means use topicality (1 is

     Arg3: The minimum number of documents in which a term must appear in order to be considered valid. Many terms are
the products of spurious
parsing and are actually not valid terms; however, these occur because of unique circumstances, and thus tend to
happen only once in the corpus.
By requiring that terms appear in at least two documents, these errors are omitted from further analysis. However,
experience has shown that for our
corpuses (on the order of 10-40K documents), requiring terms to appear in 3 documents eliminates a number of genuine
terms. Hence 2 is the
optimum for our data set; larger data sets may be different.

     Arg4: The number of top-water-related (TF) documents that should be averaged together to determine the weight of
a given term. In general
the relatedness of a given candidate water authority to water is determined by whether it appears in water-related
documents. It is tempting,
then, to simply average the water-relatedness of all the documents in which that entity exists to determine the
entity's score. However, an
entity such as a local government institution (e.g. City Council) might be highly related to water (it is involved in
many water-related decisions)
and also highly related to many other areas alongside water, and thus appear both in water-related and
non-water-related documents. To allow
for this, the routine averages the top N TF scores of documents in which the term appears. (If the term appears in
fewer than N documents, all of them are averaged.)

     Arg5: The number of nodes to include in the network. This is set manually, but if there are fewer valid nodes,
all of them are included.

     Arg6: Include a water keyword. If set (any positive number), only entities that have a non-zero value in the
'includesWaterKeyword' field in candidate_water_authorities table will be included in the analysis. If false, all
entities will be included.

Several supporting routines are also provided:

     call synonymize("Grand Valley Irrigation", "Grand Valley Irrigation Company");

This considers all instances of the first argument as if they were actually instances of the second. Calling the
'baseline()' routine re-sets all synonyms. It is also possible to re-set a single instance using:

     call desynonymize("Grand Valley Irrigation");

To leave the rest of the analyses intact but remove all synonyms, use:

     call clear_synonymy();

You can stipulate that a candidate water authority be left out of the analysis by using:

     call ignoreCandidate("Grand Valley Irrigation");

As with synonymy, calling baseline() re-sets this so that all candidates are included. This can be restored using:

     call includeCandidate("Grand Valley Irrigation");


     call includeAllCandidates();

You can see a list of the synonymized candidates using:

     call showSynonyms();

And of all ignored candidates with:

     call showIgnored();

You can manually specify that a term has a water keyword (even if it doesn't) using:

     call specifyHasWaterKeyword("HydroCo Inc.");

You can also specify that a candidate authority does not have a water keyword (even if it does) using:

     call specifyHasNoWaterKeyword("Grand Valley Irrigation");

You can clear all of the manually specified water keyword flags using:

     call clearSpecifiedWaterKeywords();