Lots of bulk requests being rejected by data node

Hi all,

One of my data nodes is always running 4 "bulk" tasks, has a waiting queue of 500 (max queue limit in our settings) and millions of rejected requests.

GET _cat/thread_pool?v
image
If I check the tasks running there, none of them is "long running" (according to running_time_in_nanos field) thus I guess it's just a lot of small tasks arriving overwhelming my this queue.

I also notice I have some shards still INITIALIZING (they belong to very big index of around 4.2~TB) and they're indeed being processed by the same problematic node rejecting other requests. In addition, there are no errors reported by the explain API for these and indeed 1 week ago it was 10 shards in this state, I measured the progress and seems that each shards takes around 15 hrs to finish and become fully assigned. Only three remain now (will probably complete in around 3-4 days) but this huge duration worries me.

puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-latest                             4     r      INITIALIZING                    10.184.95.1  data1-iil-001_data
puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-latest                             5     r      INITIALIZING                    10.184.95.1  data1-iil-001_data
puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-latest                             0     r      INITIALIZING                    10.184.95.1  data1-iil-001_data

The tasks related to initializations above look like this:

GET _tasks?actions=internal:*&detailed

"NdV2wnCaTKWjNr1Skdss1g": {
      "name": "data1-iil-001_data",
      "transport_address": "10.184.95.1:9300",
      "host": "icsl7074.iil.intel.com",
      "ip": "10.184.95.1:9300",
      "roles": [
        "data"
      ],
      "attributes": {
        "ibi_site": "iil",
        "box_type": "hot"
      },
      "tasks": {
        "NdV2wnCaTKWjNr1Skdss1g:299757825": {
          "node": "NdV2wnCaTKWjNr1Skdss1g",
          "id": 299757825,
          "type": "netty",
          "action": "internal:index/shard/recovery/translog_ops",
          "description": "",
          "start_time_in_millis": 1567631974145,
          "running_time_in_nanos": 95220328892,
          "cancellable": false
        },
        "NdV2wnCaTKWjNr1Skdss1g:299833445": {
          "node": "NdV2wnCaTKWjNr1Skdss1g",
          "id": 299833445,
          "type": "netty",
          "action": "internal:index/shard/recovery/translog_ops",
          "description": "",
          "start_time_in_millis": 1567632018694,
          "running_time_in_nanos": 50670944997,
          "cancellable": false
        },
        "NdV2wnCaTKWjNr1Skdss1g:299917403": {
          "node": "NdV2wnCaTKWjNr1Skdss1g",
          "id": 299917403,
          "type": "netty",
          "action": "internal:index/shard/recovery/translog_ops",
          "description": "",
          "start_time_in_millis": 1567632062311,
          "running_time_in_nanos": 7053950865,
          "cancellable": false
        }
      }
    }

Do you know what's the best known method to be able to continue handling incoming bulk requests as the init tasks above are occupying the available slots? (which is 4 in our setup for max bulk requests)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.