Lots of bulk requests being rejected by data node

viniciof · September 4, 2019, 9:22pm

Hi all,

One of my data nodes is always running 4 "bulk" tasks, has a waiting queue of 500 (max queue limit in our settings) and millions of rejected requests.

GET _cat/thread_pool?v

If I check the tasks running there, none of them is "long running" (according to running_time_in_nanos field) thus I guess it's just a lot of small tasks arriving overwhelming my this queue.

I also notice I have some shards still INITIALIZING (they belong to very big index of around 4.2~TB) and they're indeed being processed by the same problematic node rejecting other requests. In addition, there are no errors reported by the explain API for these and indeed 1 week ago it was 10 shards in this state, I measured the progress and seems that each shards takes around 15 hrs to finish and become fully assigned. Only three remain now (will probably complete in around 3-4 days) but this huge duration worries me.

puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-latest                             4     r      INITIALIZING                    10.184.95.1  data1-iil-001_data
puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-latest                             5     r      INITIALIZING                    10.184.95.1  data1-iil-001_data
puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-latest                             0     r      INITIALIZING                    10.184.95.1  data1-iil-001_data

The tasks related to initializations above look like this:

GET _tasks?actions=internal:*&detailed

"NdV2wnCaTKWjNr1Skdss1g": {
      "name": "data1-iil-001_data",
      "transport_address": "10.184.95.1:9300",
      "host": "icsl7074.iil.intel.com",
      "ip": "10.184.95.1:9300",
      "roles": [
        "data"
      ],
      "attributes": {
        "ibi_site": "iil",
        "box_type": "hot"
      },
      "tasks": {
        "NdV2wnCaTKWjNr1Skdss1g:299757825": {
          "node": "NdV2wnCaTKWjNr1Skdss1g",
          "id": 299757825,
          "type": "netty",
          "action": "internal:index/shard/recovery/translog_ops",
          "description": "",
          "start_time_in_millis": 1567631974145,
          "running_time_in_nanos": 95220328892,
          "cancellable": false
        },
        "NdV2wnCaTKWjNr1Skdss1g:299833445": {
          "node": "NdV2wnCaTKWjNr1Skdss1g",
          "id": 299833445,
          "type": "netty",
          "action": "internal:index/shard/recovery/translog_ops",
          "description": "",
          "start_time_in_millis": 1567632018694,
          "running_time_in_nanos": 50670944997,
          "cancellable": false
        },
        "NdV2wnCaTKWjNr1Skdss1g:299917403": {
          "node": "NdV2wnCaTKWjNr1Skdss1g",
          "id": 299917403,
          "type": "netty",
          "action": "internal:index/shard/recovery/translog_ops",
          "description": "",
          "start_time_in_millis": 1567632062311,
          "running_time_in_nanos": 7053950865,
          "cancellable": false
        }
      }
    }

Do you know what's the best known method to be able to continue handling incoming bulk requests as the init tasks above are occupying the available slots? (which is 4 in our setup for max bulk requests)

system · October 2, 2019, 9:22pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High Rejections - bulk api Elasticsearch	10	1365	February 20, 2020
Bulk queue_size Elasticsearch	9	12710	July 5, 2017
High bulk rejection on specific nodes Elasticsearch	6	1509	April 19, 2018
Bulk request is throwing rejected execution exception Elasticsearch	2	6291	December 12, 2019
Elasticsearch upgrade from elasticsearch 5.1.2 to 5.6.14, index operation tps rise cause bulk rejected Elasticsearch	12	801	June 21, 2019

Lots of bulk requests being rejected by data node

Related topics