Rejections in Thread_Pool v6.2.4

CCCowboy · October 26, 2021, 12:29pm

I have a problem with a node (different each time) that stops responding. When checking _cat/thread_pool in "hanging mode", the GET queue is full and have started rejected requests.
The log file, though, indicates that it is the bulk queue that is full:
"bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@46a91dcd[Running, pool size = 12, active threads = 12, queued tasks = 200, completed tasks = 13766271]"
There are no errors in the log file preceding these ones, but I found plenty of entries similar to this one (the last before the error)
[DATEandTIME][DEBUG][o.e.c.u.c.QueueResizingEsThreadPoolExecutor] [NODENAME/search]: there were [2000] tasks in [14.8s], avg task time [10.6ms], EWMA task execution [4ms], [93.99 tasks/s], optimal queue is [93], current capacity [1000]

/_cat/thread_pool   
				(#active, queue, rejected)
nodename  bulk                     0     0        0
nodename  get                      12    1000     28304
#removed other queues but they were all zeros

Q: Why this discrepancy? Is there somewhere else I can look? And why no mentioning in the log file (level debug) about the GET thread pool rejecting?

Using 6.2.4 and Yes, I am aware that this is fixed in later releases but I still want to understand and to be able to point somewhere and say, "Yes, this makes sense"
Have 5 nodes, 5 primaries and 1 replica for each primary, heap size 26GB and more than twice as much in RAM.

Top lines from error message:

[DATEandTIME][DEBUG][r.suppressed             ] path: /index-listresults/worklistresults/a49670d4-1c31-ec11-8f2f-005056a48c06, params: {index=index-listresults, id=a49670d4-1c31-ec11-8f2f-005056a48c06, type=listresults}
org.elasticsearch.transport.RemoteTransportException: [NODENAME][IP:9300][indices:data/write/bulk[s][p]]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.transport.TransportService$7@21b7ccd0 on EsThreadPoolExecutor[name = NODENAME/bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@46a91dcd[Running, pool size = 12, active threads = 12, queued tasks = 200, completed tasks = 13766271]]
	at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.2.4.jar:6.2.4]
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) ~[?:1.8.0_265]
	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) ~[?:1.8.0_265]

Thank you in advance.

warkolm · October 26, 2021, 9:40pm

6.2 is very old and well past EOL, please upgrade

What is the output from the _cluster/stats?pretty&human API?

CCCowboy · October 27, 2021, 6:25am

I would upgrade if I could, I promise.

We have 7 indices, but only the main index has any size, the others are only a few MB in size and contains relatively few documents. I assume they doesn't really matter. Note, the stats are fresh and not from "the event".

{
  "_nodes" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "cluster_name" : "CLUSTER_NAME",
  "timestamp" : 1635313484184,
  "status" : "green",
  "indices" : {
    "count" : 7,
    "shards" : {
      "total" : 70,
      "primaries" : 35,
      "replication" : 1.0,
      "index" : {
        "shards" : {
          "min" : 10,
          "max" : 10,
          "avg" : 10.0
        },
        "primaries" : {
          "min" : 5,
          "max" : 5,
          "avg" : 5.0
        },
        "replication" : {
          "min" : 1.0,
          "max" : 1.0,
          "avg" : 1.0
        }
      }
    },
    "docs" : {
      "count" : 55173326,
      "deleted" : 20068581
    },
    "store" : {
      "size_in_bytes" : 198764766069
    },
    "fielddata" : {
      "memory_size_in_bytes" : 0,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 9171749944,
      "total_count" : 17385272448,
      "hit_count" : 11384077626,
      "miss_count" : 6001194822,
      "cache_size" : 43212,
      "cache_count" : 46960464,
      "evictions" : 46917252
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 367,
      "memory_in_bytes" : 415248745,
      "terms_memory_in_bytes" : 338595885,
      "stored_fields_memory_in_bytes" : 65508128,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 303936,
      "points_memory_in_bytes" : 8569848,
      "doc_values_memory_in_bytes" : 2270948,
      "index_writer_memory_in_bytes" : 1548096,
      "version_map_memory_in_bytes" : 137352,
      "fixed_bit_set_memory_in_bytes" : 18826160,
      "max_unsafe_auto_id_timestamp" : -1,
      "file_sizes" : { }
    }
  },
  "nodes" : {
    "count" : {
      "total" : 5,
      "data" : 5,
      "coordinating_only" : 0,
      "master" : 5,
      "ingest" : 5
    },
    "versions" : [
      "6.2.4"
    ],
    "os" : {
      "available_processors" : 60,
      "allocated_processors" : 60,
      "names" : [
        {
          "name" : "Windows Server 2016",
          "count" : 1
        },
        {
          "name" : "Windows Server 2019",
          "count" : 4
        }
      ],
      "mem" : {
        "total_in_bytes" : 343594700800,
        "free_in_bytes" : 138332061696,
        "used_in_bytes" : 205262639104,
        "free_percent" : 40,
        "used_percent" : 60
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 11
      },
      "open_file_descriptors" : {
        "min" : -1,
        "max" : -1,
        "avg" : 0
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 1363259788,
      "versions" : [
        {
          "version" : "1.8.0_265",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "25.265-b01",
          "vm_vendor" : "Amazon.com Inc.",
          "count" : 5
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 72765756608,
        "heap_max_in_bytes" : 139150295040
      },
      "threads" : 846
    },
    "fs" : {
      "total_in_bytes" : 1878898225152,
      "free_in_bytes" : 1654923526144,
      "available_in_bytes" : 1654923526144
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "netty4" : 5
      },
      "http_types" : {
        "netty4" : 5
      }
    }
  }
}

warkolm · October 27, 2021, 6:35am

You've got an extremely high percentage, 36%, of deleted docs.

Also why do you have so many shards for such a small data set?

Can you make them all the same OS?

CCCowboy · October 27, 2021, 7:02am

Thanks Mark, for fast reply.

It is the customer's environment so I cannot change their OS, but I have remarked upon it. It was originally the same OS but when the customer rebuilt this particular node (thinking it was only this node that had the problem) they got the wrong OS by mistake.
Our use case is to store documents and update them, hence the many deleted documents. And the number of shards is because the defaults were used when the indices were created. It is corrected in future releases.
Any ideas on the discrepancy between the logs and stats - why the Thread_pool says GET and the logs say BULK? Or even better, any reasons for why this error occurs and ideas on how it can be prevented (except for upgrading to a later release ) ?

system · November 24, 2021, 7:03am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bulk thread pool rejections Elasticsearch	5	6520	July 6, 2017
Not all search threads are being used Elasticsearch	4	577	February 15, 2019
Reduce Bulk rejections Elasticsearch	5	659	September 15, 2021
Thread_pools increase value Elasticsearch	4	3244	April 25, 2017
Index Thread Pools Elasticsearch	2	392	June 9, 2023

Rejections in Thread_Pool v6.2.4

Related topics