Bulk Request Handling - Requests being handled by single node

FatalGlitch · July 24, 2019, 6:21pm

We have a 19 node hot/warm/master cluster. Logstash is pointed at only the hot nodes.

Recently we are seeing that no matter which hot node gets the actual bulk request, only a single node is handling the request, usually not the node that received the request. If we restart the single node handling the requests, this behavior just moves to a different node, but still only that node handles the bulk requests.

We have verified that shards (primary and replicas) are distributed across all nodes properly, with no hot spots. We have verified that the load balancing of logstash to multiple hot nodes is occurring by reviewing the network traffic as well.

Our indices have settings for routing to ensure they only go to nodes marked as hot, and we've verified that all the hot nodes do in fact still have their hot tagging.

But we cannot determine why all the requests are still being handled by only a single node.

Christian_Dahlqvist · July 24, 2019, 6:29pm

How do you monitor that one node handles all requests? How many hot nodes do you have? How many indices are you actively indexing into? How many primary shards do these indices have?

FatalGlitch · July 24, 2019, 6:34pm

We are monitoring the bulk queue in threadpool, along with the active threads. Both are only increasing or even at normal levels on the one node.

6 hot nodes

Indexing into ~250 indices, set as dailies. They vary on number of primary shards (we push to keep any single shard under 40GB), and all have 1 replica.

Christian_Dahlqvist · July 24, 2019, 6:39pm

It sounds to me like you have far too many indices. 250 daily indices sounds excessive. I would recommend that you read this blog post if you have not already.

It sounds like your indices vary in size quite a lot. What is the chance you have a hot index? Have you looked at indexing statistics per index and mapped this to nodes?

FatalGlitch · July 24, 2019, 6:42pm

We only keep around 4000 indices total open across hot and warm. In hot, we only keep about 750 indices total across 6 nodes, and the hot nodes are very large systems ( 16 cpu, 64GB memory, 30GB heap and nvme disks). We've done a lot of tuning as well. The rest of the indices are closed to ensure the overhead stays down. We are well aware of the limits of indices in the cluster.

This is a new issue we've never seen before, and we've been running for a couple years at this scale and larger.

FatalGlitch · July 24, 2019, 6:43pm

If the issue was a hot index, we'd still see bulk requests for the non-hot indices being handled on the other nodes, but that is not the case.

Christian_Dahlqvist · July 24, 2019, 6:45pm

Do you use dynamic mappings for any of the indices? Is it possible some index updates mappings frequently (which could take a while for a cluster that size with that many indices and shards) causing the bulk queue to build up just there?

FatalGlitch · July 24, 2019, 6:48pm

No, mappings are all static

Christian_Dahlqvist · July 24, 2019, 7:05pm

Then I do not have any more suggestions at the moment.

FatalGlitch · July 24, 2019, 7:47pm

We have a new theory....

We are using the AWS EC2 zen discovery, and it appears that the node handling all the bulk requests is always in Availability Zone A or B. If the node in A is handling all requests, we can kill it, and all the requests now get handled by the node in B. If we then kill the node in B, they all go back to node A.

Is it possible the EC2 zen discovery has some control/policy on how the bulk requests are handled?

system · August 21, 2019, 7:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issue with a single node in cluster seemingly doing all the bulk indexing! Elasticsearch	3	1142	July 5, 2017
BulkRequest causes hotspot Elasticsearch	10	2069	July 5, 2017
Logstash Bulk load balancing Logstash	1	598	January 19, 2017
Bulk indexing requests are mostly queued on one node in the cluster Elasticsearch	3	555	December 28, 2020
Why my bulk request not using all cpus of others node? Elasticsearch	27	658	June 8, 2019

Bulk Request Handling - Requests being handled by single node

Related topics