Search optimisation

I have around 120 million records using the documents structured as

{
"flink-index-deduplicated" : {
"aliases" : { },
"mappings" : {
"properties" : {
"arrayinstance" : {
"type" : "keyword"
},
"bearing" : {
"type" : "keyword",
"fields" : {
"value" : {
"type" : "integer"
}
}
},
"bearingkey" : {
"type" : "keyword"
},
"hashstring" : {
"type" : "keyword",
"fields" : {
"partial" : {
"type" : "text",
"analyzer" : "lsh"
}
}
},
"priorrepeats" : {
"type" : "integer"
},
"sample" : {
"type" : "float"
},
"sampleindex" : {
"type" : "long"
},
"sampleindexkey" : {
"type" : "keyword"
}
}
},
"settings" : {
"index" : {
"number_of_shards" : "50",
"provided_name" : "flink-index-deduplicated",
"creation_date" : "1575272478393",
"analysis" : {
"analyzer" : {
"lsh" : {
"type" : "custom",
"tokenizer" : "whitespace"
}
}
},
"number_of_replicas" : "5",
"uuid" : "kt87Sc6_QMioT8P7l861iw",
"version" : {
"created" : "7030099"
}
}
}
}
}

A 'simple' search against the index similar to

curl -XGET "http://localhost:9200/flink-index-deduplicated/_search?timeout=3600s" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"filter": [
{
"term": {
"bearingkey": "78"
}
}
],
"must": [
{
"match": {"hashstring.partial":
{
"query": "0-94643816497506701514649903939963618 1-69391278870828555903775170996274541 2-129885322910133729901345651957790127 3-201068757842693162898835419479653087 4-181740299751079897526468213943750395 5-51290148435458682070888217917765859 6-19835801397405244496329174930048305 7-31300816465703694250589088597757893 8-1236789379445915357049976232520579 9-1795667984052509688357405517923499 10-20517836545050008078402267208376486 11-22081425342676136780901944721533469 12-83971774049707587801836139658713761 13-1902988773762100008282105632768357 14-27414036635134105973247175816551996 15-97751889667953482648815842490650711",
"minimum_should_match": 2
}
}
}
]
}
}
}'

is taken around 20 seconds. Is this to be expected? Or have I engineered the query badly?

The instance is spread over two USB disks each holding around 200Gb. ES has 10Gb RAM allocated on a 2.3Ghz i7 MacBook Pro. The matches are likely to be spread uniformly amongst the documents, which are identified uniquely by bearing and sampleindex. Typically a query returns up to 200 matches.

If I repeat he query the result is returned instantly, as I might expect from caching. I'm running the query against each of the sample index values for a given bearing in order, and I'm not guaranteed that the matches to successive queries will be correlated in any way. So a query on sampleindex=12345 need not cache the results for samplendex=12346, for example.

What I find most confusing is that according to the Activity Monitor readings ES is doing very little during the twenty or seconds, nor is there excessive disk IO taking place.

Given your description of data volume and setup it sounds like disk I/O might be a potential bottleneck. Have you measured disk utilization and iowait?

50 primary shards for up to 400GB of data also sounds axcessive and could impact performance. I would recommend reindexing into 8-10 primary shards and see how that affects performance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.