Wiki
Clone wikignd / ElasticSearch_trials
ElasticSearch
We're using a hosted instance of ElasticSearch, with an interactive UI here: http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/_plugin/head/
JSON REST via GET
Some client technologies struggle at passing JSON attachments in a GET operation. CURL seems to manage it ok, but my Eclipse REST plugin won't do it.
The workaround is to pass it in as a source parameter. The problem is discussed at: http://stackoverflow.com/questions/7046166/search-elasticsearch-via-get-using-json
With the reference for the solution at the end of: http://www.elasticsearch.org/guide/reference/api/
CouchDb River
Use this command to configure ES with our CouchDb river. Note: data is the name of the index we're creating. We'll query against it later.
curl -d @create_river.js -XPUT http://localhost:9200/_river/data/_meta
And here is the contents of that configuration file:
{
"type" : "couchdb",
"couchdb" : {
"host" : "localhost",
"port" : 5984,
"db" : "tracks",
"filter" : null,
"ignore_attachments" : true,
"view" : "tracks/_view/metadata"
},
"index" : {
"index" : "data",
"type" : "dataset",
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}
Note that we're using the "view" attribute. That's still not a core part of ES. It's development home is at: https://github.com/elasticsearch/elasticsearch-river-couchdb/pull/2
Found.No river
So, here's the config we'll use for the Found river:
{
"type" : "couchdb",
"couchdb" : {
"host" : "gnd.iriscouch.com",
"port" : 5984,
"db" : "tracks",
"filter" : null
},
"index" : {
"index" : "data",
"type" : "dataset",
"bulk_size" : "100",
"bulk_timeout" : "10ms"
}
}
We'll plug it into ES via:
curl -d @found_river.js -XPUT http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/_river/data/_meta
Mapping in ES
Here's the mapping file that Devendra produced. We need it in order to direct ES in how to parse the incoming documents.
{
"dataset" : {
"dynamic" : "false",
"_source" : {
"enabled" : false
},
"properties" : {
"metadata" : {
"properties" : {
"data_type" : {
"type" : "string",
"store" : "yes"
},
"name" : {
"type" : "string",
"store" : "yes"
},
"platform" : {
"type" : "string",
"index" : "not_analyzed",
"store" : "yes"
},
"platform_type" : {
"type" : "string",
"index" : "not_analyzed",
"store" : "yes"
},
"sensor" : {
"type" : "string",
"index" : "not_analyzed",
"store" : "yes"
},
"sensor_type" : {
"type" : "string",
"index" : "not_analyzed",
"store" : "yes"
},
"time_bounds" : {
"properties" : {
"end" : {
"type" : "date",
"store" : "yes",
"format" : "dateOptionalTime"
},
"start" : {
"type" : "date",
"store" : "yes",
"format" : "dateOptionalTime"
}
}
},
"trial" : {
"type" : "string",
"index" : "not_analyzed",
"store" : "yes"
},
"type" : {
"type" : "string",
"store" : "yes"
}
}
}
}
}
}
ElasticSearch DSL
ElasticSearch comes with a Query DSL. This section contains findings about how to do/structure search queries.
URI Search
It's possible to express a simple search using a URI, use this to search for all documents for the platform LOYGA: http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/data/dataset/_search?q=platform:LOYGA
The volume of data returned can be filtered using the fields parameter: http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/data/dataset/_search?q=platform:LOYGA&fields=_id,metadata
The output can be made more human-readable by adding the pretty attribute: http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/data/dataset/_search?q=platform:LOYGA&fields=_id,metadata&pretty=true
These previous example have all been of the URI-mode of search, where a specific API is provided to support search. This API doesn't cover the whole of the ES API, however. So, accessing the rest of the API is supported by allowing a block of JSON to be specified in the source parameter:
The source parameter is necessary when the client app is only able to use GET requests (such as via JSONP), since many clients can't pass attachments in a GET request.
Note, when using the above URI via CURL, you need to provide the -g parameter to prevent CURL from falling over when globbing the request: curl -XGET -g 'http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/_search?source={"query":{"match_all":{}}}'
Composite search
We may wish to search where there are multiple acceptable values:
{
"size" : 60,
"query" :
{
"bool" :
{
"should" : [
{ "term" : { "platform" : "plat_b" } } ,
{ "term" : { "platform" : "plat_c" } }
],
"minimum_number_should_match" : 1
}
}
}
Note, in experiments, the above will return the matches for plat_b, then the matches for plat_c
{
"size" : 6,
"query" :
{
"bool" :
{
"must" :
{
"term" : { "platformType" : "van" }
},
"should" : [
{ "term" : { "platform" : "plat_b" } } ,
{ "term" : { "platform" : "plat_c" } }
],
"minimum_number_should_match" : 1
}
}
}
Specifying fields
It is possible to indicate to ES which metadata fields you wish to be displayed in the search results. Here's an example of a search for a particular platform, indicating that sensor and platform data should be in the results:
{
"fields" : ["sensor","platform"],
"query": {
"bool": {
"must": [
{
"term": {
"platform": "WIGHT SUN"
}
}
]
}
}
}
Multiple fields
Here's an example of how to search for a single work across multiple fields:
{"fields":["start","platform","name"],"query":
{
"dis_max" : {
"queries" : [
{
"text" : { "platform" : "AASLI" }
},
{
"text" : { "name" : "AASLI" }
}
]
}
} }
Note that we're using "text" for both fields. Previously we we're searching for "term" in the platform field, and "text" in the name field. This was because platform is analysed as a set of terms. But, there doesn't seem to be a performance cost in using text for both.
Note that we're using the dis_max construct: http://www.elasticsearch.org/guide/reference/query-dsl/dis-max-query.html
This can incur some performance penalties. Should performance doing a 'match any of these fields' query become a problem, then we should probably let ES revert to keeping the _all field enabled.
User-formatted search strings
The field attribute lets you search for a query built up from parsing free text:
{
"size" : 6,
"query" :
{
"field" : {
"platform" : "plat_a OR plat_b"
}
}
}
This will return platforms called plat_a or plat_b. It will also accept a wildcard like plat_*
Simpler field searches
The query search attribute lets you fire search strings against multiple categories. For example this will find cars called plat_a:
{
"size" : 60,
"query" :
{
"query_string" :
{
"query" : "plat_a AND car" }
}
}
}
Specific attribute search
I guess in real-life we'll mostly be using the BOOL construct to build up a range of query blocks. As we loop through each multi-select list, if it has any items selected we'll add that attribute to the "must" part of the query:
{
"size" : 60,
"query" :
{
"bool" :
{
"must" :
[
{
"terms" :
{
"platformType" : [ "car", "van" ],
"minimum_match" : 1
}
},
{
"terms" :
{
"platform" : [ "plat_a" ],
"minimum_match" : 1
}
}
,
{
"terms" :
{
"sensorType" : [ "speed", "temp" ],
"minimum_match" : 1
}
}
]
}
}
}
Date tests
It's a common requirement to find all datasets within a time period (although there's a strong chance that filtering by trial will probably also meet the requirement).
There are a range of way of seeing if dates overlap:
The right-hand side of the table shows which time relationships represent an overlap.
The fast form of testing if period1 overlaps with period2 is:
if((period1.start < period2.end) && (period1.end > period2.start))
Such a test is illustrated in:
{
"query" :
{
"bool" :
{
"must" :
[
{
"range" : {
"start" : {
"lte" : "2012-03-14T11:40:00+0000"
}
}
},
{
"range" : {
"end" : {
"gte" : "2012-03-14T00:00:00+0000"
}
}
}
]
}
}
}
Here's an example of a range search:
{
"fields": [
"start",
"end",
"platform"
],
"query": {
"bool": {
"must": [
{
"range": {
"start": {
"from": "2012-02-07T07:00:00",
"to": "2012-02-07T08:00:00"
}
}
}
]
}
}
}
And here's the results:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.0,
"hits" : [ {
"_index" : "data",
"_type" : "dataset",
"_id" : "0d9500c05ef3a901397b5d7dda12b89f",
"_score" : 1.0,
"fields" : {
"metadata.platform" : "HYUNDAI FORCE",
"metadata.sensor" : ""
}
}, {
"_index" : "data",
"_type" : "dataset",
"_id" : "0d9500c05ef3a901397b5d7dda12b165",
"_score" : 1.0,
"fields" : {
"metadata.platform" : "236064263",
"metadata.sensor" : ""
}
} ]
}
}
Faceted Search
Here are some UI suggestions for faceted search: http://webusability-blog.com/faceted-search-4-design-tips/
Finding all available tags
Here is how to request all of the tags used in the specified field types:
{
"size": 0,
"query" : { "match_all" : {} },
"facets": {
"platform_type": {
"terms": {
"field": "platform_type",
"size": 10000
}
},
"platform": {
"terms": {
"field": "platform",
"size": 10000
}
},
"sensor": {
"terms": {
"field": "sensor",
"size": 10000
}
},
"sensor_type": {
"terms": {
"field": "sensor_type",
"size": 10000
}
},
"trial": {
"terms": {
"field": "trial",
"size": 10000
}
},
"type": {
"terms": {
"field": "type",
"size": 10000
}
},
"data_type": {
"terms": {
"field": "metadata.data_type",
"size": 1000
}
}
}
}
Note in the above how we retrieve all of the available data-types for the specified query (all documents in this case, but it also works for filtered queries).
Here's an live example of a 'show all tags' search - it just returns a list of the platform names: curl -XGET http://e8deb49618b89a11489dab0b0e23711c-us-east-1.foundcluster.com:9200/data/dataset/_search?pretty=true -d '{"size":0,"facets":{"platform":{"terms":{"field":"platform","size":1000}}}}'
Note: from JSONP it isn't possible to provide a payload. The workaround in JSON would be to POST the facets query. But, in JSONP we can only GET. So, this is resolved by using a GET and putting the payload into the source attribute as shown higher on this page.
Find tags that are applicable to current search
If the query block is included in the payload, only tags matching the results in the search are shown.
Recommended Backend Structure
Since the number of indexed documents is not expected to grow at a fast rate, there's really not a lot to worry about ES can handle millions of documents without issue with it's default settings.
The biggest thing that would need to be considered would be durability. The Data index has 5 shards and 1 replica and is currently in "yellow" state because the replica is not allocated. You would need to add a 2nd machine to hold the replica to put the cluster in "green" state. That's probably not needed until you get closer to production. In the mean time, you can set replicas to 0 if you want "green" cluster state.
One other thing to possibly consider is that you could reduce the number of shards to 1 or 2. If you're not using routing, then ES will search all 5 shards in order to return search results. Having only 1 or 2 shards will make ES searches return faster, however, for the small number of documents, it probably makes no discernible difference.
Updated