Optimising Elasticsearch Search Queries

Hi all,

We have an ES cluster setup on AWS with 3 master nodes (c4.xlarge.elasticsearch) & 3 data nodes (c4.xlarge.elasticsearch) with a single index using 5 shards.

Our mapping looks like this.

{
"mappings": {
"_doc": {
"_doc": {
"properties": {
"action_id": {
"type": "integer"
},
"amount_in_base_currency": {
"type": "scaled_float",
"scaling_factor": 100000
},
"client_id": {
"type": "integer"
},
"has_commission": {
"type": "boolean"
},
"created_at": {
"type": "integer"
},
"from_uuid": {
"type": "keyword",
"copy_to": "query_uuid"
},
"id": {
"type": "keyword"
},
"status": {
"type": "byte"
},
"to_uuid": {
"type": "keyword",
"copy_to": "query_uuid"
},
"transaction_hash": {
"type": "keyword"
},
"transfer_events": {
"properties": {
"amount_in_base_currency": {
"type": "scaled_float",
"scaling_factor": 100000
},
"from_address": {
"type": "keyword"
},
"from_uuid": {
"type": "keyword"
},
"to_address": {
"type": "keyword"
},
"to_uuid": {
"type": "keyword"
}
}
},
"type": {
"type": "byte"
},
"updated_at": {
"type": "integer"
},
"query_uuid": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
}
}

We have around 30 million records indexed in this cluster and have been working towards optimising our select queries (we have written scripts which are able to fire 1000 - 1500 parallel queries to this ES cluster).

This is the query we are trying to optimise

{"query":{"bool":{"filter":[{"term":{"type":"1"}},{"match":{"query_uuid":"5282b830-ab5b-48e0-aacc-7fbaafcd3179"}}]}},"from":0,"size":10,"sort":[{"created_at":"desc"}]}

We realised that under this heavy load selects were taking 3300 milliseconds (as reported under took key in ES response).

We then used Profile API to deep dive into the numbers.

For every shards the query times as reported above are (10410196 , 9969214 , 33323904 , 35199400, 17725846) nanoseconds which are between 1 to 3.5 milliseconds.

{
"type": "BooleanQuery",
"description": "#type:[1 TO 1] #query_uuid:5282b830-ab5b-48e0-aacc-7fbaafcd3179",
"time_in_nanos": 17725846,
"breakdown": {
"score": 0,
"build_scorer_count": 40,
"match_count": 0,
"create_weight": 663288,
"next_doc": 3130748,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 498,
"score_count": 0,
"build_scorer": 13931271,
"advance": 0,
"advance_count": 0
}

reading up on documentation we understand that total query times are also impacted by collector time

"collector": [
{
"name": "CancellableCollector",
"reason": "search_cancelled",
"time_in_nanos": 25127530,
"children": [
{
"name": "MultiCollector",
"reason": "search_top_hits",
"time_in_nanos": 17963249
}
]
}
]

which for each shard in this case are 31831976, 22071150, 56140508, 27768799, 25127530 ie between 30 to 50 milliseconds.

30 to 50 milliseconds for collector + 1 to 3.5 milliseconds for query time doesn't justify over 3.3 seconds reported by 'took' field in the response.

We have also configured slow query logs on ES AWS. Even for the queries which report their 'took' times as 3 - 4 seconds the time we see (taken by each shard) in slow query logs isn't more than 80 - 90 milliseconds.

Our hitch here is that the master nodes which are aggregating responses from each shard and returning to the client are taking the remaining time. Can someone please help us understand the reason & reduce these times ?

What version are you on? What JVM, OS?
What does your query actually look like? What about the mapping? How are you querying Elasticsearch? How are you monitoring Elasticsearch?

For someone to help I think you need a bit more information.

We are using Elasticsearch service by AWS.

Also updated the question with our mapping and a sample query we are facing issues with.

How does query latency vary if you start with a smaller number of concurrent queries and then gradually increase it to the levels you are trying now? What does CPU load look like on the nodes?

Hi Christian , PFB the below cpu Utilization graph .We tried the below

  1. We increased the data node instance to c4.4xlarge and now the cpu utilization is under 60 % previously it was going around 80%.The query time has improved but we still see the above pattern.

Are you performing any indexing or updates during this time?

Nope indexing was already done

Did you try to gradually increase the query concurrency? If so, how does this correspond to the graphs shown?

We ran 700 queries parallely to 2100 queries parallely and the cpu graph increased from 40% to 60%

Are you monitoring the response to the queries you are sending to see how many of them are actually successful? Given that you are sending so many concurrent queries I would expect your queues to fill up the internal queues and some of the queries therefore to be rejected. The fact that accepted queries end up sitting in the queue for a while may be why measured latency is longer than what is reported. You should be able to see the size of the queues using e.g. the cat thread pools API.

I would recommend you watch this Elastic{ON} talk, as I believe it talks about the difference between latency and service time.

Once you are monitoring the response to the queries (if you are not already) you can start at a lower level of concurrency level and see how the portion of successful queries and latency changes are concurrency increases. At the same time track how the size of the queues in Elasticsearch changes.

If your data set is small, you may get better performance by switching to a single primary shard but 2 replicas instead so that all nodes hold all data. This will also have the benefit of taking up fewer slots on the internal queues compared to your current configuration.

I would also recommend using our benchmarking tool Rally for this as it does a lot of this out of the box.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.