Hi!
I am driving a very huge elasticsearch cluster (jdk13, elasticsearch 6.8.5) with ~200tb of data and a lot of nodes spread over several DC. We index about logs ~6-7 tb daily.
To be short:
curl -X GET "master.elasticsearch.ec.odkl.ru:9200/_cluster/health?pretty=true"
{
"cluster_name" : "graylog",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 460,
"number_of_data_nodes" : 360,
"active_primary_shards" : 3960,
"active_shards" : 7920,
"relocating_shards" : 3,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
From time to time I face problems with various API like the cut API if 2-3 data-nodes are very busy and don't respond to master in time.
curl -X GET "localhost:9200/_cat/shards?pretty" <- this request just hangs forever.
curl -X GET "localhost:9200/_cluster/health?pretty=true" <- this one goes well.
The log on the master while the problem is happening
Nov 29 16:46:10 6.master.elasticsearch.domain.name elasticsearch[47]: [2019-11-29T16:46:10,073][WARN ][o.e.t.TransportService ] [6.master.elasticsearch.domain.name] Received response for a request that has timed out, sent [527048ms] ago, timed out [512041ms] ago, action [cluster:monitor/nodes/stats[n]], node [{76.data.elasticsearch.domain.name}{r4n33INBTGiomOSI0lLu3w}{S9CFoSH6Swe-mvDVXtCMuA}{10.21.131.208}{10.21.131.208:9300}{zone=dc, xpack.installed=true}], id [50696273]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: [2019-11-29T16:46:19,866][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [6.master.elasticsearch.domain.name] failed to execute on node [r4n33INBTGiomOSI0lLu3w]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: org.elasticsearch.transport.ReceiveTimeoutTransportException: [76.data.elasticsearch.domain.name][10.21.131.208:9300][cluster:monitor/nodes/stats[n]] request_id [50967043] timed out after [15009ms]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.5.jar:6.8.5]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.5.jar:6.8.5]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
Nov 29 16:46:19 6.master.elasticsearch.domain.name elasticsearch[47]: at java.lang.Thread.run(Thread.java:830) [?:?]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: [2019-11-29T16:46:20,009][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [6.master.elasticsearch.domain.name] failed to execute on node [IdrbFd5bQ6mr-Niia6ve3A]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: org.elasticsearch.transport.ReceiveTimeoutTransportException: [30.data.elasticsearch.domain.name][10.21.131.157:9300][cluster:monitor/nodes/stats[n]] request_id [50967277] timed out after [15008ms]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1016) [elasticsearch-6.8.5.jar:6.8.5]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-6.8.5.jar:6.8.5]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
Nov 29 16:46:20 6.master.elasticsearch.domain.name elasticsearch[47]: at java.lang.Thread.run(Thread.java:830) [?:?]
Nov 29 16:46:35 6.master.elasticsearch.domain.name elasticsearch[47]: [2019-11-29T16:46:35,010][WARN ][o.e.c.InternalClusterInfoService] [6.master.elasticsearch.domain.name] Failed to update shard information for ClusterInfoUpdateJob within 15s timeout
Nov 29 16:46:55 6.master.elasticsearch.domain.name elasticsearch[47]: [2019-11-29T16:46:55,530][WARN ][o.e.t.TransportService ] [6.master.elasticsearch.domain.name] Received response for a request that has timed out, sent [532450ms] ago, timed out [517444ms] ago, action [cluster:monitor/nodes/stats[n]], node [{76.data.elasticsearch.domain.name}{r4n33INBTGiomOSI0lLu3w}{S9CFoSH6Swe-mvDVXtCMuA}{10.21.131.208}{10.21.131.208:9300}{zone=dc, xpack.installed=true}], id [50717235]
Actually the request to the master should not freeze just due 1-2 nodes being stuck for some reason.
Timeouts configured for the cluster:
discovery.zen.publish_timeout 5s
discovery.zen.commit_timeout 5s
transport.tcp.connect_timeout: 3s
discovery.zen.fd.ping_timeout: 10s
discovery.zen.fd.ping_retries: 3
discovery.zen.fd.ping_interval: 1s
discovery.zen.join_timeout: 120s
transport.tcp.connect_timeout: 3s
http.tcp.keep_alive: true
It seems like a bug to me. I would be very appreciative for any help about this case, since the stability of this API is mission critical for our business (we need to know which nodes can be turned off for any reason without getting a "red" cluster)