ES v5.4.0 Bulk Requests Rejection

I have a cluster with data node 7 which has 100+ indexes, with shards number varying from 7-14, with 1 replica. As the indexes are time-series, all of them get created at 12 AM UTC. The problem is that all the primary shards of most of the indexes are allocated to one node(say N1) and few primary shards and replicas are assigned to other nodes.

Values set to:
cluster.routing.rebalance.enable: all
cluster.routing.allocation.allow_rebalance: indices_all_active
thread_pool.bulk.queue_size: 1000

During bulk indexing requests, all of the requests go to that node (N1) and CPU utilization increases for this node. A lot of requests are also rejected as the queue size is exceeded on that node. Whereas other nodes are pretty chilled out.

Doubts:

  1. Is the above issue, is it because all the primary shards are on one node only?
  2. If yes, Can I rebalance my primary shards by setting "cluster.routing.rebalance.enable" to "primaries". Would this configuration first rebalance my primary shards and then balance the replicas? Are there any repercussions.
  3. Is there any other cause of the issue and is there a way to mitigate it?

Some reasons I could imagine lots of new shards getting assigned to one node would be

  1. At the time of allocation this node has the fewest number of shards. Perhaps there's a race around that decision when 100+ indices are simultaneously created?
  2. Disk watermarks or some other allocation limits are being hit in other nodes, preventing them from getting assigned. One way to gain insight into how ES is making these allocation decisions is to manually change the allocation of one of these primary shards and see if ES responds with some other rebalancing.

If nothing jumps out at you along these lines, perhaps consider using shard allocation filtering to ensure a more widespread initial (and eventual) distribution of the shards.

Good luck!

This seems to be a duplicate of this issue. Please do not open multiple threads for the same issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.