ElasticSearch 1.7: query time spike, and the application processor crash after that

Weizhi_Li · November 30, 2017, 2:01am

I used ES as the search engine, the QPS is high (i. e, 1500 rpm), and find some time there is a spike in some nodes, and I have set up timeout is 2s. The traffic is stuck after that.

Any ideas to solve this issues? Cache the queries or other solutions?

Thanks,

Christian_Dahlqvist · November 30, 2017, 6:19am

Do you see anything in the logs around that time, e.g. regarding long GC?

Weizhi_Li · November 30, 2017, 7:29pm

We did not see anything special in our ES logs, there was no error.

We reviewed our cluster metrics at that time and did find the spike in indices.search.query_time_in_millis, but only on one of these data nodes.

Christian_Dahlqvist · November 30, 2017, 7:32pm

It doesn't have to be an error. Anything indicating slow merging or long GC could contribute.

Weizhi_Li · November 30, 2017, 8:07pm

We identified that the merge time (indices.merges.total_time_in_millis) was very high for that particular data node (max 1k for the bad node vs. 100 for the rest). Looks like slow merging was the root cause? Is there a way to solve this issue of node load not being balanced?

Christian_Dahlqvist · November 30, 2017, 8:30pm

Look at data distribution. I would also recommend upgrading as you are running a very old version.

Weizhi_Li · November 30, 2017, 9:21pm

OK. Why do you think upgrading ES will solve this issue? Do you mean that the new version can handle slow merging or long GC ?

For ES 1.7, I am not sure if you can give more suggestions to handle this issue (We reviewed our cluster metrics at that time and did find the spike in indices.search.query_time_in_millis, but only on one of these data nodes.)

Thanks

Christian_Dahlqvist · November 30, 2017, 10:09pm

What sets the problematic node apart from the others? Are indices being indexed into spread out evenly? Are you using routing or parent-child, which could result in an uneven balance? Do all nodes have the same specification and configuration?

Weizhi_Li · November 30, 2017, 10:54pm

In our current cluster setup, we have 6 data nodes, while one of them also being master. All nodes have the same specification and configuration despite the node type.

We have multiple indices (with various size from 10M to 2G) that can take different amount of traffic. There is no specific routing setup (everything is default). Our indexed documents do not contain parent-child setups.

One potential issue we discovered was that, we are using default settings for number of shards / number of replica when building index. Do you have any suggestions on optimal shard / replica assignments in this use case?

Christian_Dahlqvist · December 1, 2017, 3:08am

Each nested document is represented by a number of documents behind the scenes. When you update a document all of these are updated. If you have large and/or deep nested documents this can result in a lot of indexing work. How large are your largest nested document?

system · December 29, 2017, 3:09am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Periodic CPU spikes Elasticsearch	9	2888	July 6, 2017
Long period of querying failure during node timeout Elasticsearch	4	1038	May 15, 2020
Latency and CPU spike on all nodes simultaneously Elasticsearch	1	641	February 17, 2017
Query latency spikes and Open search contexts Elasticsearch	6	1705	July 5, 2017
Newbie performance troubleshooting, high load spikes on ES nodes Elasticsearch	5	5058	June 11, 2018

ElasticSearch 1.7: query time spike, and the application processor crash after that

Related topics