Hi all,
One of my data nodes is always running 4 "bulk" tasks, has a waiting queue of 500 (max queue limit in our settings) and millions of rejected requests.
GET _cat/thread_pool?v
If I check the tasks running there, none of them is "long running" (according to running_time_in_nanos field) thus I guess it's just a lot of small tasks arriving overwhelming my this queue.
I also notice I have some shards still INITIALIZING (they belong to very big index of around 4.2~TB) and they're indeed being processed by the same problematic node rejecting other requests. In addition, there are no errors reported by the explain API for these and indeed 1 week ago it was 10 shards in this state, I measured the progress and seems that each shards takes around 15 hrs to finish and become fully assigned. Only three remain now (will probably complete in around 3-4 days) but this huge duration worries me.
puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-latest 4 r INITIALIZING 10.184.95.1 data1-iil-001_data
puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-latest 5 r INITIALIZING 10.184.95.1 data1-iil-001_data
puma.compilation.pipeline.96f19f5b-bc84-4d4b-8694-b80a293e78e4-latest 0 r INITIALIZING 10.184.95.1 data1-iil-001_data
The tasks related to initializations above look like this:
GET _tasks?actions=internal:*&detailed
"NdV2wnCaTKWjNr1Skdss1g": {
"name": "data1-iil-001_data",
"transport_address": "10.184.95.1:9300",
"host": "icsl7074.iil.intel.com",
"ip": "10.184.95.1:9300",
"roles": [
"data"
],
"attributes": {
"ibi_site": "iil",
"box_type": "hot"
},
"tasks": {
"NdV2wnCaTKWjNr1Skdss1g:299757825": {
"node": "NdV2wnCaTKWjNr1Skdss1g",
"id": 299757825,
"type": "netty",
"action": "internal:index/shard/recovery/translog_ops",
"description": "",
"start_time_in_millis": 1567631974145,
"running_time_in_nanos": 95220328892,
"cancellable": false
},
"NdV2wnCaTKWjNr1Skdss1g:299833445": {
"node": "NdV2wnCaTKWjNr1Skdss1g",
"id": 299833445,
"type": "netty",
"action": "internal:index/shard/recovery/translog_ops",
"description": "",
"start_time_in_millis": 1567632018694,
"running_time_in_nanos": 50670944997,
"cancellable": false
},
"NdV2wnCaTKWjNr1Skdss1g:299917403": {
"node": "NdV2wnCaTKWjNr1Skdss1g",
"id": 299917403,
"type": "netty",
"action": "internal:index/shard/recovery/translog_ops",
"description": "",
"start_time_in_millis": 1567632062311,
"running_time_in_nanos": 7053950865,
"cancellable": false
}
}
}
Do you know what's the best known method to be able to continue handling incoming bulk requests as the init tasks above are occupying the available slots? (which is 4 in our setup for max bulk requests)