Cat API stuck if some data nodes are overloaded

Hi!

I am driving a very huge elasticsearch cluster (jdk13, elasticsearch 6.8.5) with ~200tb of data and a lot of nodes spread over several DC. We index about logs ~6-7 tb daily.

To be short:

curl -X GET "master.elasticsearch.ec.odkl.ru:9200/_cluster/health?pretty=true"
{
  "cluster_name" : "graylog",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 460,
  "number_of_data_nodes" : 360,
  "active_primary_shards" : 3960,
  "active_shards" : 7920,
  "relocating_shards" : 3,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

From time to time I face problems with various API like the cut API if 2-3 data-nodes are very busy and don't respond to master in time.

curl -X GET "localhost:9200/_cat/shards?pretty" <- this request just hangs forever.
curl -X GET "localhost:9200/_cluster/health?pretty=true" <- this one goes well.

The log on the master while the problem is happening

Nov 29 16:46:10 6.master.elasticsearch.domain.name elasticsearch[47]: [2019-11-29T16:46:10,073][WARN ][o.e.t.TransportService   ] [6.master.elasticsearch.domain.name] Received response for a request that has timed out, sent [527048ms] ago, timed out [512041ms] ago, action [cluster:monitor/nodes/stats[n]], node [{76.data.elasticsearch.domain.name}{r4n33INBTGiomOSI0lLu3w}{S9CFoSH6Swe-mvDVXtCMuA}{10.21.131.208}{10.21.131.208:9300}{zone=dc, xpack.installed=true}], id [50696273]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: [2019-11-29T16:46:19,866][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [6.master.elasticsearch.domain.name] failed to execute on node [r4n33INBTGiomOSI0lLu3w]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: org.elasticsearch.transport.ReceiveTimeoutTransportException: [76.data.elasticsearch.domain.name][10.21.131.208:9300][cluster:monitor/nodes/stats[n]] request_id [50967043] timed out after [15009ms]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.5.jar:6.8.5]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.5.jar:6.8.5]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: at java.lang.Thread.run(Thread.java:830) [?:?]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: [2019-11-29T16:46:20,009][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [6.master.elasticsearch.domain.name] failed to execute on node [IdrbFd5bQ6mr-Niia6ve3A]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: org.elasticsearch.transport.ReceiveTimeoutTransportException: [30.data.elasticsearch.domain.name][10.21.131.157:9300][cluster:monitor/nodes/stats[n]] request_id [50967277] timed out after [15008ms]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.5.jar:6.8.5]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.5.jar:6.8.5]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: at java.lang.Thread.run(Thread.java:830) [?:?]
Nov 29 16:46:35 6.master.elasticsearch.domain.name elasticsearch[47]: [2019-11-29T16:46:35,010][WARN ][o.e.c.InternalClusterInfoService] [6.master.elasticsearch.domain.name] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
Nov 29 16:46:55 6.master.elasticsearch.domain.name elasticsearch[47]: [2019-11-29T16:46:55,530][WARN ][o.e.t.TransportService   ] [6.master.elasticsearch.domain.name] Received response for a request that has timed out, sent [532450ms] ago, timed out [517444ms] ago, action [cluster:monitor/nodes/stats[n]], node [{76.data.elasticsearch.domain.name}{r4n33INBTGiomOSI0lLu3w}{S9CFoSH6Swe-mvDVXtCMuA}{10.21.131.208}{10.21.131.208:9300}{zone=dc, xpack.installed=true}], id [50717235]

Actually the request to the master should not freeze just due 1-2 nodes being stuck for some reason.

Timeouts configured for the cluster:

discovery.zen.publish_timeout 5s
discovery.zen.commit_timeout 5s
transport.tcp.connect_timeout: 3s
discovery.zen.fd.ping_timeout: 10s
discovery.zen.fd.ping_retries: 3
discovery.zen.fd.ping_interval: 1s
discovery.zen.join_timeout: 120s
transport.tcp.connect_timeout: 3s
http.tcp.keep_alive: true

It seems like a bug to me. I would be very appreciative for any help about this case, since the stability of this API is mission critical for our business (we need to know which nodes can be turned off for any reason without getting a "red" cluster)

Did a bit more research on that issue.

The problematic data nodes are all stuck in a very special way:

node_name                                                    name                           active rejected size t       max
25.data.elasticsearch.domain.name          fetch_shard_started      0        0    1 scaling               112
25.data.elasticsearch.domain.name          fetch_shard_store        0        0    1 scaling               112
25.data.elasticsearch.domain.name          flush                    0        0    2 scaling                 5
25.data.elasticsearch.domain.name          force_merge              0        0    1 fixed                   1
25.data.elasticsearch.domain.name          generic                  0        0    5 scaling                 5
!!!!!  25.data.elasticsearch.domain.name   management               5        0    5 scaling                 5**
25.data.elasticsearch.domain.name          refresh                  0        0   10 scaling                10
25.data.elasticsearch.domain.name          search                   0        0   80 fixed_auto_queue_size  80
25.data.elasticsearch.domain.name          write                    0        0   56 fixed                  56

Actually, the node isn't stuck searching\indexing, it's stuck for something else.
The docs don't give any description to the management thread pool at all.

The management threadpool is used to serve some cluster management actions like _cat/shards; if it is busy then that would indeed explain why those APIs do not return.

Normally I'd recommend looking at the hot threads API, but this API also uses the management threadpool so it's likely that this won't work. Instead, I recommend getting a thread dump from the process directly using something like jstack.

We had a decreased count of generic threads.

I'm going to revert this to defaults:
thread_pool.generic.core: 3
thread_pool.generic.max: 5
thread_pool.generic.keep_alive: 5m

I spend some time investigating and found this on the hanged node:

{
  "tasks" : {
   "0Wd3U0XoQGeB4rnOvQKwVA:32897339" : {
  "node" : "0Wd3U0XoQGeB4rnOvQKwVA",
  "id" : 32897339,
  "type" : "transport",
  "action" : "indices:monitor/stats",
  "start_time_in_millis" : 1575359174649,
  "running_time_in_nanos" : 139075012452840,
  "cancellable" : false,
  "headers" : { }
    },

Isn't that supposed to be killed by kind of a timeout?

jstack: https://controlc.com/3acbf4a6

This node looks overloaded with stats requests. All five management threads are serving indices stats requests; three of them are busy getting the last-modified time of a translog file and the other two look busy doing something with per-segment stats.

It's possible that you're overloading your cluster with stats calls. Maybe this is a consequence of hitting GET _cat/shards too hard, or maybe you have IO issues. I'm also suspicious about your choice of JVM and GC since I don't think they're supported.