We identified that the merge time (indices.merges.total_time_in_millis) was very high for that particular data node (max 1k for the bad node vs. 100 for the rest). Looks like slow merging was the root cause? Is there a way to solve this issue of node load not being balanced?
OK. Why do you think upgrading ES will solve this issue? Do you mean that the new version can handle slow merging or long GC ?
For ES 1.7, I am not sure if you can give more suggestions to handle this issue (We reviewed our cluster metrics at that time and did find the spike in indices.search.query_time_in_millis, but only on one of these data nodes.)
What sets the problematic node apart from the others? Are indices being indexed into spread out evenly? Are you using routing or parent-child, which could result in an uneven balance? Do all nodes have the same specification and configuration?
In our current cluster setup, we have 6 data nodes, while one of them also being master. All nodes have the same specification and configuration despite the node type.
We have multiple indices (with various size from 10M to 2G) that can take different amount of traffic. There is no specific routing setup (everything is default). Our indexed documents do not contain parent-child setups.
One potential issue we discovered was that, we are using default settings for number of shards / number of replica when building index. Do you have any suggestions on optimal shard / replica assignments in this use case?
Each nested document is represented by a number of documents behind the scenes. When you update a document all of these are updated. If you have large and/or deep nested documents this can result in a lot of indexing work. How large are your largest nested document?