Wiki

Clone wiki

gnd / ElasticSearch_trials

ElasticSearch

We're using a hosted instance of ElasticSearch, with an interactive UI here: http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/_plugin/head/

JSON REST via GET

Some client technologies struggle at passing JSON attachments in a GET operation. CURL seems to manage it ok, but my Eclipse REST plugin won't do it.

The workaround is to pass it in as a source parameter. The problem is discussed at: http://stackoverflow.com/questions/7046166/search-elasticsearch-via-get-using-json

With the reference for the solution at the end of: http://www.elasticsearch.org/guide/reference/api/

CouchDb River

Use this command to configure ES with our CouchDb river. Note: data is the name of the index we're creating. We'll query against it later.

curl -d @create_river.js -XPUT http://localhost:9200/_river/data/_meta

And here is the contents of that configuration file:

{
	 "type" : "couchdb",
    "couchdb" : {
        "host" : "localhost",
        "port" : 5984,
        "db" : "tracks",
        "filter" : null,
        "ignore_attachments" : true,
        "view" : "tracks/_view/metadata"
    },
    "index" : {
        "index" : "data",
        "type" : "dataset",
        "bulk_size" : "100",
        "bulk_timeout" : "10ms"
    }
}

Note that we're using the "view" attribute. That's still not a core part of ES. It's development home is at: https://github.com/elasticsearch/elasticsearch-river-couchdb/pull/2

Found.No river

So, here's the config we'll use for the Found river:

{
    "type" : "couchdb",
"couchdb" : {
   "host" : "gnd.iriscouch.com",
   "port" : 5984,
   "db" : "tracks",
   "filter" : null
},
"index" : {
   "index" : "data",
   "type" : "dataset",
   "bulk_size" : "100",
   "bulk_timeout" : "10ms"
}
}

We'll plug it into ES via:

curl -d @found_river.js -XPUT http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/_river/data/_meta

Mapping in ES

Here's the mapping file that Devendra produced. We need it in order to direct ES in how to parse the incoming documents.

{
  "dataset" : {
    "dynamic" : "false",
    "_source" : {
      "enabled" : false
    },
    "properties" : {
      "metadata" : {
        "properties" : {
          "data_type" : {
            "type" : "string",
            "store" : "yes"
          },
          "name" : {
            "type" : "string",
            "store" : "yes"
          },
          "platform" : {
            "type" : "string",
            "index" : "not_analyzed",
            "store" : "yes"
          },
          "platform_type" : {
            "type" : "string",
            "index" : "not_analyzed",
            "store" : "yes"
          },
          "sensor" : {
            "type" : "string",
            "index" : "not_analyzed",
            "store" : "yes"
          },
          "sensor_type" : {
            "type" : "string",
            "index" : "not_analyzed",
            "store" : "yes"
          },
          "time_bounds" : {
            "properties" : {
              "end" : {
                "type" : "date",
                "store" : "yes",
                "format" : "dateOptionalTime"
              },
              "start" : {
                "type" : "date",
                "store" : "yes",
                "format" : "dateOptionalTime"
              }
            }
          },
          "trial" : {
            "type" : "string",
            "index" : "not_analyzed",
            "store" : "yes"
          },
          "type" : {
            "type" : "string",
            "store" : "yes"
          }
        }
      }
    }
  }
}

ElasticSearch DSL

ElasticSearch comes with a Query DSL. This section contains findings about how to do/structure search queries.

It's possible to express a simple search using a URI, use this to search for all documents for the platform LOYGA: http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/data/dataset/_search?q=platform:LOYGA

The volume of data returned can be filtered using the fields parameter: http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/data/dataset/_search?q=platform:LOYGA&fields=_id,metadata

The output can be made more human-readable by adding the pretty attribute: http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/data/dataset/_search?q=platform:LOYGA&fields=_id,metadata&pretty=true

These previous example have all been of the URI-mode of search, where a specific API is provided to support search. This API doesn't cover the whole of the ES API, however. So, accessing the rest of the API is supported by allowing a block of JSON to be specified in the source parameter:

http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/_search?source={"query":{"match_all":{}}}

The source parameter is necessary when the client app is only able to use GET requests (such as via JSONP), since many clients can't pass attachments in a GET request.

Note, when using the above URI via CURL, you need to provide the -g parameter to prevent CURL from falling over when globbing the request: curl -XGET -g 'http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/_search?source={"query":{"match_all":{}}}'

We may wish to search where there are multiple acceptable values:

{
	"size" : 60,
	"query" : 
	{
		"bool" : 
		{
			"should" : [
				{ "term" : { "platform" : "plat_b" } } ,		            
				{ "term" : { "platform" : "plat_c" } }	            
		    ],
		    "minimum_number_should_match" : 1
		}
	}
}

Note, in experiments, the above will return the matches for plat_b, then the matches for plat_c

{
	"size" : 6,
	"query" : 
	{
		"bool" : 
		{
			"must" :
			{
				"term" : { "platformType" : "van" }
			},
			"should" : [
				{ "term" : { "platform" : "plat_b" } } ,		            
				{ "term" : { "platform" : "plat_c" } }	            
		    ],
		    "minimum_number_should_match" : 1
		}
	}
}

Specifying fields

It is possible to indicate to ES which metadata fields you wish to be displayed in the search results. Here's an example of a search for a particular platform, indicating that sensor and platform data should be in the results:

{
 "fields" : ["sensor","platform"],
 "query": {
  "bool": {
  "must": [
  {
  "term": {
  "platform": "WIGHT SUN"
  }
  }
  ]
  }
 }
}

Multiple fields

Here's an example of how to search for a single work across multiple fields:

{"fields":["start","platform","name"],"query":	
{
    "dis_max" : {
        "queries" : [
            {
                "text" : { "platform" : "AASLI" }
            },
            {
                "text" : { "name" : "AASLI" }
            }
        ]
    }
}    }

Note that we're using "text" for both fields. Previously we we're searching for "term" in the platform field, and "text" in the name field. This was because platform is analysed as a set of terms. But, there doesn't seem to be a performance cost in using text for both.

Note that we're using the dis_max construct: http://www.elasticsearch.org/guide/reference/query-dsl/dis-max-query.html

This can incur some performance penalties. Should performance doing a 'match any of these fields' query become a problem, then we should probably let ES revert to keeping the _all field enabled.

User-formatted search strings

The field attribute lets you search for a query built up from parsing free text:

{
	"size" : 6,
	"query" : 
	{
	    "field" : { 
	        "platform" : "plat_a OR plat_b"
	    }
	}
}

This will return platforms called plat_a or plat_b. It will also accept a wildcard like plat_*

Simpler field searches

The query search attribute lets you fire search strings against multiple categories. For example this will find cars called plat_a:

{
        "size" : 60,
        "query" : 
        {
                "query_string" : 
                {
                    "query" : "plat_a AND car" }                             
                }
        }
}

Specific attribute search

I guess in real-life we'll mostly be using the BOOL construct to build up a range of query blocks. As we loop through each multi-select list, if it has any items selected we'll add that attribute to the "must" part of the query:

{
        "size" : 60,
        "query" : 
        {
        	"bool" : 
        		{
        			"must" : 
        				[
	        				{
		                	    "terms" :
		                	    {
		                	        "platformType" : [ "car", "van" ],
		                	        "minimum_match" : 1
		                	    }
	        				},
	        				{
		                	    "terms" :
		                	    {
		                	        "platform" : [ "plat_a" ],
		                	        "minimum_match" : 1
		                	    }
	        				}
	        				,
	        				{
		                	    "terms" :
		                	    {
		                	        "sensorType" : [ "speed", "temp" ],
		                	        "minimum_match" : 1
		                	    }
	        				}
        				]
        		}
        }
}

Date tests

It's a common requirement to find all datasets within a time period (although there's a strong chance that filtering by trial will probably also meet the requirement).

There are a range of way of seeing if dates overlap: date overlaps

The right-hand side of the table shows which time relationships represent an overlap.

The fast form of testing if period1 overlaps with period2 is:

if((period1.start < period2.end) && (period1.end > period2.start))

Such a test is illustrated in:

{
        "query" : 
        {
        	"bool" : 
        		{
        			"must" : 
        				[
	        				{
	        				    "range" : {
	        				        "start" : { 
	        				            "lte" : "2012-03-14T11:40:00+0000" 
	        				        }
	        				    }
	        				},
	        				{
	        				    "range" : {
	        				        "end" : { 
	        				            "gte" : "2012-03-14T00:00:00+0000" 
	        				        }
	        				    }
	        				}
        				]
        		}
        }
}

Here's an example of a range search:

{
  "fields": [
    "start",
    "end",
    "platform"
  ],
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "start": {
              "from": "2012-02-07T07:00:00",
              "to": "2012-02-07T08:00:00"
            }
          }
        }
      ]
    }
  }
}

And here's the results:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "data",
      "_type" : "dataset",
      "_id" : "0d9500c05ef3a901397b5d7dda12b89f",
      "_score" : 1.0,
      "fields" : {
        "metadata.platform" : "HYUNDAI FORCE",
        "metadata.sensor" : ""
      }
    }, {
      "_index" : "data",
      "_type" : "dataset",
      "_id" : "0d9500c05ef3a901397b5d7dda12b165",
      "_score" : 1.0,
      "fields" : {
        "metadata.platform" : "236064263",
        "metadata.sensor" : ""
      }
    } ]
  }
}

Faceted Search

Here are some UI suggestions for faceted search: http://webusability-blog.com/faceted-search-4-design-tips/

Finding all available tags

Here is how to request all of the tags used in the specified field types:

{
  "size": 0,
  "query" : { "match_all" : {} },
  "facets": {
    "platform_type": {
      "terms": {
        "field": "platform_type",
        "size": 10000
      }
    },
    "platform": {
      "terms": {
        "field": "platform",
        "size": 10000
      }
    },
    "sensor": {
      "terms": {
        "field": "sensor",
        "size": 10000
      }
    },
    "sensor_type": {
      "terms": {
        "field": "sensor_type",
        "size": 10000
      }
    },
    "trial": {
      "terms": {
        "field": "trial",
        "size": 10000
      }
    },
    "type": {
      "terms": {
        "field": "type",
        "size": 10000
      }
    },
    "data_type": {
      "terms": {
        "field": "metadata.data_type",
        "size": 1000
      }
    }
  }
}

Note in the above how we retrieve all of the available data-types for the specified query (all documents in this case, but it also works for filtered queries).

Here's an live example of a 'show all tags' search - it just returns a list of the platform names: curl -XGET http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/data/dataset/_search?pretty=true -d '{"size":0,"facets":{"platform":{"terms":{"field":"platform","size":1000}}}}'

Note: from JSONP it isn't possible to provide a payload. The workaround in JSON would be to POST the facets query. But, in JSONP we can only GET. So, this is resolved by using a GET and putting the payload into the source attribute as shown higher on this page.

If the query block is included in the payload, only tags matching the results in the search are shown.

Recommended Backend Structure

Since the number of indexed documents is not expected to grow at a fast rate, there's really not a lot to worry about ES can handle millions of documents without issue with it's default settings.

The biggest thing that would need to be considered would be durability. The Data index has 5 shards and 1 replica and is currently in "yellow" state because the replica is not allocated. You would need to add a 2nd machine to hold the replica to put the cluster in "green" state. That's probably not needed until you get closer to production. In the mean time, you can set replicas to 0 if you want "green" cluster state.

One other thing to possibly consider is that you could reduce the number of shards to 1 or 2. If you're not using routing, then ES will search all 5 shards in order to return search results. Having only 1 or 2 shards will make ES searches return faster, however, for the small number of documents, it probably makes no discernible difference.

Updated