Optimising Elasticsearch Search Queries

puneetkhushwani · July 10, 2018, 10:05am

Hi all,

We have an ES cluster setup on AWS with 3 master nodes (c4.xlarge.elasticsearch) & 3 data nodes (c4.xlarge.elasticsearch) with a single index using 5 shards.

Our mapping looks like this.

{
"mappings": {
"_doc": {
"_doc": {
"properties": {
"action_id": {
"type": "integer"
},
"amount_in_base_currency": {
"type": "scaled_float",
"scaling_factor": 100000
},
"client_id": {
"type": "integer"
},
"has_commission": {
"type": "boolean"
},
"created_at": {
"type": "integer"
},
"from_uuid": {
"type": "keyword",
"copy_to": "query_uuid"
},
"id": {
"type": "keyword"
},
"status": {
"type": "byte"
},
"to_uuid": {
"type": "keyword",
"copy_to": "query_uuid"
},
"transaction_hash": {
"type": "keyword"
},
"transfer_events": {
"properties": {
"amount_in_base_currency": {
"type": "scaled_float",
"scaling_factor": 100000
},
"from_address": {
"type": "keyword"
},
"from_uuid": {
"type": "keyword"
},
"to_address": {
"type": "keyword"
},
"to_uuid": {
"type": "keyword"
}
}
},
"type": {
"type": "byte"
},
"updated_at": {
"type": "integer"
},
"query_uuid": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
}
}

We have around 30 million records indexed in this cluster and have been working towards optimising our select queries (we have written scripts which are able to fire 1000 - 1500 parallel queries to this ES cluster).

This is the query we are trying to optimise

{"query":{"bool":{"filter":[{"term":{"type":"1"}},{"match":{"query_uuid":"5282b830-ab5b-48e0-aacc-7fbaafcd3179"}}]}},"from":0,"size":10,"sort":[{"created_at":"desc"}]}

We realised that under this heavy load selects were taking 3300 milliseconds (as reported under took key in ES response).

We then used Profile API to deep dive into the numbers.

For every shards the query times as reported above are (10410196 , 9969214 , 33323904 , 35199400, 17725846) nanoseconds which are between 1 to 3.5 milliseconds.

{
"type": "BooleanQuery",
"description": "#type:[1 TO 1] #query_uuid:5282b830-ab5b-48e0-aacc-7fbaafcd3179",
"time_in_nanos": 17725846,
"breakdown": {
"score": 0,
"build_scorer_count": 40,
"match_count": 0,
"create_weight": 663288,
"next_doc": 3130748,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 498,
"score_count": 0,
"build_scorer": 13931271,
"advance": 0,
"advance_count": 0
}

reading up on documentation we understand that total query times are also impacted by collector time

"collector": [
{
"name": "CancellableCollector",
"reason": "search_cancelled",
"time_in_nanos": 25127530,
"children": [
{
"name": "MultiCollector",
"reason": "search_top_hits",
"time_in_nanos": 17963249
}
]
}
]

which for each shard in this case are 31831976, 22071150, 56140508, 27768799, 25127530 ie between 30 to 50 milliseconds.

30 to 50 milliseconds for collector + 1 to 3.5 milliseconds for query time doesn't justify over 3.3 seconds reported by 'took' field in the response.

We have also configured slow query logs on ES AWS. Even for the queries which report their 'took' times as 3 - 4 seconds the time we see (taken by each shard) in slow query logs isn't more than 80 - 90 milliseconds.

Our hitch here is that the master nodes which are aggregating responses from each shard and returning to the client are taking the remaining time. Can someone please help us understand the reason & reduce these times ?

warkolm · July 10, 2018, 10:10am

What version are you on? What JVM, OS?
What does your query actually look like? What about the mapping? How are you querying Elasticsearch? How are you monitoring Elasticsearch?

For someone to help I think you need a bit more information.

puneetkhushwani · July 10, 2018, 10:16am

We are using Elasticsearch service by AWS.

Also updated the question with our mapping and a sample query we are facing issues with.

Christian_Dahlqvist · July 10, 2018, 10:30am

How does query latency vary if you start with a smaller number of concurrent queries and then gradually increase it to the levels you are trying now? What does CPU load look like on the nodes?

Aman_Pruthi · July 10, 2018, 3:20pm

Hi Christian , PFB the below cpu Utilization graph .We tried the below

We increased the data node instance to c4.4xlarge and now the cpu utilization is under 60 % previously it was going around 80%.The query time has improved but we still see the above pattern.

Screen Shot 2018-07-10 at 8.29.05 PM.jpg2418×788 129 KB

Christian_Dahlqvist · July 10, 2018, 3:28pm

Are you performing any indexing or updates during this time?

Aman_Pruthi · July 10, 2018, 4:29pm

Nope indexing was already done

Christian_Dahlqvist · July 10, 2018, 4:31pm

Did you try to gradually increase the query concurrency? If so, how does this correspond to the graphs shown?

Aman_Pruthi · July 10, 2018, 4:47pm

We ran 700 queries parallely to 2100 queries parallely and the cpu graph increased from 40% to 60%

Christian_Dahlqvist · July 10, 2018, 5:45pm

Are you monitoring the response to the queries you are sending to see how many of them are actually successful? Given that you are sending so many concurrent queries I would expect your queues to fill up the internal queues and some of the queries therefore to be rejected. The fact that accepted queries end up sitting in the queue for a while may be why measured latency is longer than what is reported. You should be able to see the size of the queues using e.g. the cat thread pools API.

I would recommend you watch this Elastic{ON} talk, as I believe it talks about the difference between latency and service time.

Once you are monitoring the response to the queries (if you are not already) you can start at a lower level of concurrency level and see how the portion of successful queries and latency changes are concurrency increases. At the same time track how the size of the queues in Elasticsearch changes.

If your data set is small, you may get better performance by switching to a single primary shard but 2 replicas instead so that all nodes hold all data. This will also have the benefit of taking up fewer slots on the internal queues compared to your current configuration.

I would also recommend using our benchmarking tool Rally for this as it does a lot of this out of the box.

system · August 7, 2018, 5:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High response time with 45 concurrent users Elasticsearch	2	442	July 6, 2017
How to increase query performance? Elasticsearch	3	369	July 6, 2017
Performance under concurrent queries Elasticsearch	7	4536	July 6, 2017
Running into Elasticsearch high search latency 5-10s issue in production Elasticsearch	13	3322	July 5, 2017
Concerned about high query times Elasticsearch	3	909	July 5, 2017

Optimising Elasticsearch Search Queries

Related topics