Hi all,
We have an ES cluster setup on AWS with 3 master nodes (c4.xlarge.elasticsearch) & 3 data nodes (c4.xlarge.elasticsearch) with a single index using 5 shards.
Our mapping looks like this.
{
"mappings": {
"_doc": {
"_doc": {
"properties": {
"action_id": {
"type": "integer"
},
"amount_in_base_currency": {
"type": "scaled_float",
"scaling_factor": 100000
},
"client_id": {
"type": "integer"
},
"has_commission": {
"type": "boolean"
},
"created_at": {
"type": "integer"
},
"from_uuid": {
"type": "keyword",
"copy_to": "query_uuid"
},
"id": {
"type": "keyword"
},
"status": {
"type": "byte"
},
"to_uuid": {
"type": "keyword",
"copy_to": "query_uuid"
},
"transaction_hash": {
"type": "keyword"
},
"transfer_events": {
"properties": {
"amount_in_base_currency": {
"type": "scaled_float",
"scaling_factor": 100000
},
"from_address": {
"type": "keyword"
},
"from_uuid": {
"type": "keyword"
},
"to_address": {
"type": "keyword"
},
"to_uuid": {
"type": "keyword"
}
}
},
"type": {
"type": "byte"
},
"updated_at": {
"type": "integer"
},
"query_uuid": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
}
}
We have around 30 million records indexed in this cluster and have been working towards optimising our select queries (we have written scripts which are able to fire 1000 - 1500 parallel queries to this ES cluster).
This is the query we are trying to optimise
{"query":{"bool":{"filter":[{"term":{"type":"1"}},{"match":{"query_uuid":"5282b830-ab5b-48e0-aacc-7fbaafcd3179"}}]}},"from":0,"size":10,"sort":[{"created_at":"desc"}]}
We realised that under this heavy load selects were taking 3300 milliseconds (as reported under took key in ES response).
We then used Profile API to deep dive into the numbers.
For every shards the query times as reported above are (10410196 , 9969214 , 33323904 , 35199400, 17725846) nanoseconds which are between 1 to 3.5 milliseconds.
{
"type": "BooleanQuery",
"description": "#type:[1 TO 1] #query_uuid:5282b830-ab5b-48e0-aacc-7fbaafcd3179",
"time_in_nanos": 17725846,
"breakdown": {
"score": 0,
"build_scorer_count": 40,
"match_count": 0,
"create_weight": 663288,
"next_doc": 3130748,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 498,
"score_count": 0,
"build_scorer": 13931271,
"advance": 0,
"advance_count": 0
}
reading up on documentation we understand that total query times are also impacted by collector time
"collector": [
{
"name": "CancellableCollector",
"reason": "search_cancelled",
"time_in_nanos": 25127530,
"children": [
{
"name": "MultiCollector",
"reason": "search_top_hits",
"time_in_nanos": 17963249
}
]
}
]
which for each shard in this case are 31831976, 22071150, 56140508, 27768799, 25127530 ie between 30 to 50 milliseconds.
30 to 50 milliseconds for collector + 1 to 3.5 milliseconds for query time doesn't justify over 3.3 seconds reported by 'took' field in the response.
We have also configured slow query logs on ES AWS. Even for the queries which report their 'took' times as 3 - 4 seconds the time we see (taken by each shard) in slow query logs isn't more than 80 - 90 milliseconds.
Our hitch here is that the master nodes which are aggregating responses from each shard and returning to the client are taking the remaining time. Can someone please help us understand the reason & reduce these times ?