Elasticsearch 7.17.10 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS

Still pretty concentrated.

Indexing rate by node, 7d:

Write queue by node, 7d:

Write queue by node, 2d to see more recent data.

The output is too large for a comment - preferred way to provide this?

Any idea why? Are the hotter nodes holding more shards seeing heavy indexing? There's still a lot of empty queues here.

Can you use https://gist.github.com?

Thanks. So again the hot threads indicates most nodes are underutilised, with many idle write threads. There's a few over-hot nodes that could be a bottleneck tho, and alerts-es-hot-8 looks to be struggling a bit with IO, spending a lot of time in force0 (i.e. fsync):

$ cat es-tasks| grep -B2 force0
   100.0% [cpu=38.2%, other=61.8%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[alerts-es-hot-8][write][T#8]'
     5/10 snapshots sharing following 27 elements
       java.base@20.0.1/sun.nio.ch.UnixFileDispatcherImpl.force0(Native Method)
--
   100.0% [cpu=33.3%, other=66.7%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[alerts-es-hot-8][write][T#6]'
     5/10 snapshots sharing following 27 elements
       java.base@20.0.1/sun.nio.ch.UnixFileDispatcherImpl.force0(Native Method)
--
       java.base@20.0.1/java.lang.Thread.run(Thread.java:1623)
     unique snapshot
       java.base@20.0.1/sun.nio.ch.UnixFileDispatcherImpl.force0(Native Method)
--
   100.0% [cpu=22.9%, other=77.1%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[alerts-es-hot-8][write][T#1]'
     5/10 snapshots sharing following 27 elements
       java.base@20.0.1/sun.nio.ch.UnixFileDispatcherImpl.force0(Native Method)

I'd suggest focussing on why there's such an imbalance in work across your nodes.

Is there a recommended way to do this with 7.17.10? I thought that shard allocation based on size or write only became available in 8.

Thanks, ok, no significant holdups on the network layer it seems.

Firstly just work out if it's needed (using e.g. GET _cat/shards). If there is a hot spot for a particular index then sometimes it's because the shards don't divide equally across the available nodes so it can be addressed just by changing the number of shards/replicas. Otherwise you can use the index.routing.allocation.total_shards_per_node index setting to force the shards of particular indices to spread out.

There are 217 empty indices that were created prematurely, but I've tested removing them without any noticeable impact to throughput - I was hoping that the shard rebalancing would have fewer "lucky" nodes that got mostly-idle shards. I also used the API to move busy shards from the busy nodes to the nodes with the lowest cpu usage, but that's obviously not a sustainable practice unless I write a service to redistribute them for me. I'd like to move to ES 8, which might help, but I'm getting some pressure to consider using opensearch to offload that troubleshooting effort, and upgrading past 7 makes that path a bit harder. I'd prefer to stay with ES, so thank you again for all your help in this thread - it's very appreciated.

I believe this was addressed in this part of the thread - only one index would be affected, and I've dropped that one from 5 shards to 4, which should allow it to be distributed across all nodes without any overlap.

OpenSearch/OpenDistro are AWS run products and differ from the original Elasticsearch and Kibana products that Elastic builds and maintains. You may need to contact them directly for further assistance.

(This is an automated response from your friendly Elastic bot. Please report this post if you have any suggestions or concerns :elasticheart: )

AFAIK OpenSearch uses the same shard allocation algorithm as Elasticsearch 7.x, so I don't think that will change anything.

Since you're managing this Elasticsearch cluster via ECK, could you provide the Elasticsearch YAML file that you're using to deploy your cluster?

Yeah, the opensearch piece is just to make it someone else's problem with a support contract.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.