ES Prod cluster Receive Timeout Transport Exception

Hi All,

In our Production cluster ES 1.7.2 we are using 3 master nodes, 3 data nodes & 1 client node and in the dedicated master log I can see the below warning message,

Master Log details

[2016-01-12 23:39:58,436][DEBUG][action.admin.cluster.node.stats] [dayrhecfdm009_PROD_MASTER] failed to execute on node [a0lfkseaTMCtDxPSUAHiaw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [dayrhecfdm008_PROD_DATA][inet[dayrhecfdm008.enterprisenet.org/10.7.157.12:9260]][cluster:monitor/nodes/stats[n]] request_id [8091207] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-01-12 23:40:08,351][WARN ][shield.transport ] [dayrhecfdm009_PROD_MASTER] Received response for a request that has timed out, sent [24916ms] ago, timed out [9915ms] ago, action [cluster:monitor/nodes/stats[n]], node [[dayrhecfdm008_PROD_DATA][a0lfkseaTMCtDxPSUAHiaw][dayrhecfdm008.enterprisenet.org][inet[dayrhecfdm008.enterprisenet.org/10.7.157.12:9260]]{max_local_storage_nodes=1, master=false}], id [8091207]
[2016-01-12 23:40:28,436][DEBUG][action.admin.cluster.node.stats] [dayrhecfdm009_PROD_MASTER] failed to execute on node [QWPVOoLqQo6m3gud_yMaRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [dayrhecfdm009_PROD_DATA][inet[dayrhecfdm009.enterprisenet.org/10.7.157.13:9260]][cluster:monitor/nodes/stats[n]] request_id [8091834] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-01-12 23:40:31,155][WARN ][shield.transport ] [dayrhecfdm009_PROD_MASTER] Received response for a request that has timed out, sent [30564ms] ago, timed out [563ms] ago, action [internal:discovery/zen/fd/ping], node [[dayrhecfdm009_PROD_DATA][QWPVOoLqQo6m3gud_yMaRQ][dayrhecfdm009.enterprisenet.org][inet[dayrhecfdm009.enterprisenet.org/10.7.157.13:9260]]{max_local_storage_nodes=1, master=false}], id [8091772]
[2016-01-12 23:40:31,161][WARN ][shield.transport ] [dayrhecfdm009_PROD_MASTER] Received response for a request that has timed out, sent [17726ms] ago, timed out [2725ms] ago, action [cluster:monitor/nodes/stats[n]], node [[dayrhecfdm009_PROD_DATA][QWPVOoLqQo6m3gud_yMaRQ][dayrhecfdm009.enterprisenet.org][inet[dayrhecfdm009.enterprisenet.org/10.7.157.13:9260]]{max_local_storage_nodes=1, master=false}], id [8091834]
[2016-01-12 23:43:58,437][DEBUG][action.admin.cluster.node.stats] [dayrhecfdm009_PROD_MASTER] failed to execute on node [a0lfkseaTMCtDxPSUAHiaw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [dayrhecfdm008_PROD_DATA][inet[dayrhecfdm008.enterprisenet.org/10.7.157.12:9260]][cluster:monitor/nodes/stats[n]] request_id [8096901] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Data log details

[2016-01-12 18:21:27,034][WARN ][monitor.jvm ] [dayrhecfdm009_PROD_MASTER] [gc][young][247709][10906] duration [1.5s], collections [1]/[2s], total [1.5s]/[12.3m], memory [26.9gb]->[25.8gb]/[29.7gb], all_pools {[young] [2gb]->[14mb]/[2.1gb]}{[survivor] [17.8mb]->[274.5mb]/[274.5mb]}{[old] [24.9gb]->[25.5gb]/[27.3gb]}
[2016-01-12 23:39:27,536][INFO ][monitor.jvm ] [dayrhecfdm009_PROD_MASTER] [gc][young][266779][12003] duration [806ms], collections [1]/[1.2s], total [806ms]/[13.4m], memory [27.4gb]->[26.2gb]/[29.7gb], all_pools {[young] [2.1gb]->[3mb]/[2.1gb]}{[survivor] [274.5mb]->[274.5mb]/[274.5mb]}{[old] [25.1gb]->[25.9gb]/[27.3gb]}
[2016-01-12 23:40:31,155][WARN ][monitor.jvm ] [dayrhecfdm009_PROD_MASTER] [gc][old][266812][14854] duration [30.9s], collections [1]/[31.5s], total [30.9s]/[20m], memory [28.4gb]->[25.7gb]/[29.7gb], all_pools {[young] [1.6gb]->[44.1mb]/[2.1gb]}{[survivor] [273.5mb]->[0b]/[274.5mb]}{[old] [26.5gb]->[25.6gb]/[27.3gb]}
[2016-01-12 23:43:12,285][WARN ][monitor.jvm ] [dayrhecfdm009_PROD_MASTER] [gc][young][266972][12011] duration [1.2s], collections [1]/[2s], total [1.2s]/[13.4m], memory [27.1gb]->[26.1gb]/[29.7gb], all_pools {[young] [2gb]->[17.4mb]/[2.1gb]}{[survivor] [207mb]->[274.5mb]/[274.5mb]}{[old] [24.8gb]->[25.8gb]/[27.3gb]}

Can someone help me why I am getting this warning?

Please let us know your suggestions.

Thanks,
Ganeshbabu R

Probably that, nodes in ES can timeout if GC is above 30 seconds.

Hi @warkolm

30 seconds is default value in Elasticsearch (or) Can we change it manually to avoid the time out exception?

Regards
Ganeshbabu R

You can change it (check zen discovery docs), but that's not really fixing the problem causing it, so chances are you will run into the same problem again.

Okay..

ES time out took more than 30s to ping response from other nodes when discovering.
@warkolm this warning might be because of network latency?

Regards
Ganeshbabu R

Possibly, yes.

Thanks for your response @warkolm

If possible can you pls check the below request,

Please let us know your suggestions..

Regards,
Ganeshbabu R