I have around 120 million records using the documents structured as
{
"flink-index-deduplicated" : {
"aliases" : { },
"mappings" : {
"properties" : {
"arrayinstance" : {
"type" : "keyword"
},
"bearing" : {
"type" : "keyword",
"fields" : {
"value" : {
"type" : "integer"
}
}
},
"bearingkey" : {
"type" : "keyword"
},
"hashstring" : {
"type" : "keyword",
"fields" : {
"partial" : {
"type" : "text",
"analyzer" : "lsh"
}
}
},
"priorrepeats" : {
"type" : "integer"
},
"sample" : {
"type" : "float"
},
"sampleindex" : {
"type" : "long"
},
"sampleindexkey" : {
"type" : "keyword"
}
}
},
"settings" : {
"index" : {
"number_of_shards" : "50",
"provided_name" : "flink-index-deduplicated",
"creation_date" : "1575272478393",
"analysis" : {
"analyzer" : {
"lsh" : {
"type" : "custom",
"tokenizer" : "whitespace"
}
}
},
"number_of_replicas" : "5",
"uuid" : "kt87Sc6_QMioT8P7l861iw",
"version" : {
"created" : "7030099"
}
}
}
}
}
A 'simple' search against the index similar to
curl -XGET "http://localhost:9200/flink-index-deduplicated/_search?timeout=3600s" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"filter": [
{
"term": {
"bearingkey": "78"
}
}
],
"must": [
{
"match": {"hashstring.partial":
{
"query": "0-94643816497506701514649903939963618 1-69391278870828555903775170996274541 2-129885322910133729901345651957790127 3-201068757842693162898835419479653087 4-181740299751079897526468213943750395 5-51290148435458682070888217917765859 6-19835801397405244496329174930048305 7-31300816465703694250589088597757893 8-1236789379445915357049976232520579 9-1795667984052509688357405517923499 10-20517836545050008078402267208376486 11-22081425342676136780901944721533469 12-83971774049707587801836139658713761 13-1902988773762100008282105632768357 14-27414036635134105973247175816551996 15-97751889667953482648815842490650711",
"minimum_should_match": 2
}
}
}
]
}
}
}'
is taken around 20 seconds. Is this to be expected? Or have I engineered the query badly?
The instance is spread over two USB disks each holding around 200Gb. ES has 10Gb RAM allocated on a 2.3Ghz i7 MacBook Pro. The matches are likely to be spread uniformly amongst the documents, which are identified uniquely by bearing and sampleindex. Typically a query returns up to 200 matches.
If I repeat he query the result is returned instantly, as I might expect from caching. I'm running the query against each of the sample index values for a given bearing in order, and I'm not guaranteed that the matches to successive queries will be correlated in any way. So a query on sampleindex=12345 need not cache the results for samplendex=12346, for example.
What I find most confusing is that according to the Activity Monitor readings ES is doing very little during the twenty or seconds, nor is there excessive disk IO taking place.