ElasticSearch 1.7: query time spike, and the application processor crash after that

I used ES as the search engine, the QPS is high (i. e, 1500 rpm), and find some time there is a spike in some nodes, and I have set up timeout is 2s. The traffic is stuck after that.

Any ideas to solve this issues? Cache the queries or other solutions?

Thanks,

Do you see anything in the logs around that time, e.g. regarding long GC?

We did not see anything special in our ES logs, there was no error.

We reviewed our cluster metrics at that time and did find the spike in indices.search.query_time_in_millis, but only on one of these data nodes.

It doesn't have to be an error. Anything indicating slow merging or long GC could contribute.

We identified that the merge time (indices.merges.total_time_in_millis) was very high for that particular data node (max 1k for the bad node vs. 100 for the rest). Looks like slow merging was the root cause? Is there a way to solve this issue of node load not being balanced?

Look at data distribution. I would also recommend upgrading as you are running a very old version.

OK. Why do you think upgrading ES will solve this issue? Do you mean that the new version can handle slow merging or long GC ?

For ES 1.7, I am not sure if you can give more suggestions to handle this issue (We reviewed our cluster metrics at that time and did find the spike in indices.search.query_time_in_millis, but only on one of these data nodes.)

Thanks

What sets the problematic node apart from the others? Are indices being indexed into spread out evenly? Are you using routing or parent-child, which could result in an uneven balance? Do all nodes have the same specification and configuration?

In our current cluster setup, we have 6 data nodes, while one of them also being master. All nodes have the same specification and configuration despite the node type.

We have multiple indices (with various size from 10M to 2G) that can take different amount of traffic. There is no specific routing setup (everything is default). Our indexed documents do not contain parent-child setups.

One potential issue we discovered was that, we are using default settings for number of shards / number of replica when building index. Do you have any suggestions on optimal shard / replica assignments in this use case?

Each nested document is represented by a number of documents behind the scenes. When you update a document all of these are updated. If you have large and/or deep nested documents this can result in a lot of indexing work. How large are your largest nested document?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.