Failed to execute on node while the cluster running about 20 minutes later

Hi, i'm new using Elastic, I'm facing the issue that my Elasticsearch master nodes got ReceiveTimeoutTransportException after running about 20 minutes.

Elasticsearch version - 6.3.2
Ram - 8gb
Heap size - 4gb
Documents - 10(max)
Java 1.8
OS Ubuntu Server 18.04 VM in Azure

This is config

master-node-1

cluster.name: de-test
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["104.41.161.123:9300","13.78.36.124:9300","40.115.154.30:9300"]
discovery.zen.minimum_master_nodes: 2
node.name: master-node-1

master-node-2

cluster.name: de-test
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["104.41.161.123:9300","13.78.36.124:9300","40.115.154.30:9300"]
discovery.zen.minimum_master_nodes: 2
node.name: master-node-2

master-node-3

cluster.name: de-test
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: ["104.41.161.123:9300","13.78.36.124:9300","40.115.154.30:9300"]
discovery.zen.minimum_master_nodes: 2
node.name: master-node-3

network interface of master-node-1:
eth0:
inet 172.20.0.4 netmask 255.255.255.0 broadcast 172.20.0.4
eth0:0:
inet 104.41.161.123 netmask 255.255.255.255 broadcast 104.41.161.123

This is error ->

......
    [2019-01-03T06:32:29,164][INFO ][o.e.c.s.MasterService    ] [master-node-2] zen-disco-elected-as-master ([1] nodes joined)[, ], reason: new_master {master-node-2}{-b8ZPb9DTcarvOYfHdgtHg}{2gVzr0YSQBGl9GcPW6XRug}{40.115.154.30}{40.115.154.30:9300}{ml.machine_memory=8365367296, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, added {{master-node-1}{PUGB8is1SJm9XEeRJ6bVjA}{xzjwGTefRSaxINpyi4_-Qw}{104.41.161.123}{104.41.161.123:9300}{ml.machine_memory=8365367296, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
    [2019-01-03T06:32:29,245][INFO ][o.e.c.s.ClusterApplierService] [master-node-2] new_master {master-node-2}{-b8ZPb9DTcarvOYfHdgtHg}{2gVzr0YSQBGl9GcPW6XRug}{40.115.154.30}{40.115.154.30:9300}{ml.machine_memory=8365367296, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, added {{master-node-1}{PUGB8is1SJm9XEeRJ6bVjA}{xzjwGTefRSaxINpyi4_-Qw}{104.41.161.123}{104.41.161.123:9300}{ml.machine_memory=8365367296, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {master-node-2}{-b8ZPb9DTcarvOYfHdgtHg}{2gVzr0YSQBGl9GcPW6XRug}{40.115.154.30}{40.115.154.30:9300}{ml.machine_memory=8365367296, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([1] nodes joined)[, ]]])


    ......


[2019-01-03T06:32:33,515][INFO ][o.e.c.r.a.AllocationService] [master-node-2] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[sg6-auditlog-2019.01.03][2]] ...]).
[2019-01-03T06:43:14,485][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [master-node-2] failed to execute on node [VANTBBkPR_qZsNUCWttXGA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [master-node-3][13.78.36.124:9300][cluster:monitor/nodes/stats[n]] request_id [1515] timed out after [15000ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:979) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]


[2019-01-03T06:43:14,485][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [master-node-2] failed to execute on node [PUGB8is1SJm9XEeRJ6bVjA]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [master-node-1][104.41.161.123:9300][cluster:monitor/nodes/stats[n]] request_id [1514] timed out after [15000ms]
	at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:979) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]

I reckon it's about 10 minutes between starting up your cluster and the first error:

[2019-01-03T06:32:33,515][INFO ][o.e.c.r.a.AllocationService] [master-node-2] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[sg6-auditlog-2019.01.03][2]] ...]).
[2019-01-03T06:43:14,485][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [master-node-2] failed to execute on node [VANTBBkPR_qZsNUCWttXGA]

(please use the </> button to format logs, it makes them much easier to read)

This suggests a connectivity issue - see the note in the manual about long-lived connections. Are you running on GCP? There is an idle connection timeout of 10 minutes by default there.

Sorry for the format, I'm running on Azure without firewall.

This is sysctl

ming@mix-search-de-master-1:~$ sudo sysctl -p
kernel.sysrq = 1
vm.max_map_count = 655360
net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 20

It seems that Azure's default timeout is 4 minutes.

Your keep alive time is 600 seconds, i.e. 10 minutes. I think you need to set it much lower in such a network.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.