ES Prod cluster Receive Timeout Transport Exception


(ganeshbabu) #1

Hi All,

In our Production cluster ES 1.7.2 we are using 3 master nodes, 3 data nodes & 1 client node and in the dedicated master log I can see the below warning message,

Master Log details

[2016-01-12 23:39:58,436][DEBUG][action.admin.cluster.node.stats] [dayrhecfdm009_PROD_MASTER] failed to execute on node [a0lfkseaTMCtDxPSUAHiaw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [dayrhecfdm008_PROD_DATA][inet[dayrhecfdm008.enterprisenet.org/10.7.157.12:9260]][cluster:monitor/nodes/stats[n]] request_id [8091207] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-01-12 23:40:08,351][WARN ][shield.transport ] [dayrhecfdm009_PROD_MASTER] Received response for a request that has timed out, sent [24916ms] ago, timed out [9915ms] ago, action [cluster:monitor/nodes/stats[n]], node [[dayrhecfdm008_PROD_DATA][a0lfkseaTMCtDxPSUAHiaw][dayrhecfdm008.enterprisenet.org][inet[dayrhecfdm008.enterprisenet.org/10.7.157.12:9260]]{max_local_storage_nodes=1, master=false}], id [8091207]
[2016-01-12 23:40:28,436][DEBUG][action.admin.cluster.node.stats] [dayrhecfdm009_PROD_MASTER] failed to execute on node [QWPVOoLqQo6m3gud_yMaRQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [dayrhecfdm009_PROD_DATA][inet[dayrhecfdm009.enterprisenet.org/10.7.157.13:9260]][cluster:monitor/nodes/stats[n]] request_id [8091834] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2016-01-12 23:40:31,155][WARN ][shield.transport ] [dayrhecfdm009_PROD_MASTER] Received response for a request that has timed out, sent [30564ms] ago, timed out [563ms] ago, action [internal:discovery/zen/fd/ping], node [[dayrhecfdm009_PROD_DATA][QWPVOoLqQo6m3gud_yMaRQ][dayrhecfdm009.enterprisenet.org][inet[dayrhecfdm009.enterprisenet.org/10.7.157.13:9260]]{max_local_storage_nodes=1, master=false}], id [8091772]
[2016-01-12 23:40:31,161][WARN ][shield.transport ] [dayrhecfdm009_PROD_MASTER] Received response for a request that has timed out, sent [17726ms] ago, timed out [2725ms] ago, action [cluster:monitor/nodes/stats[n]], node [[dayrhecfdm009_PROD_DATA][QWPVOoLqQo6m3gud_yMaRQ][dayrhecfdm009.enterprisenet.org][inet[dayrhecfdm009.enterprisenet.org/10.7.157.13:9260]]{max_local_storage_nodes=1, master=false}], id [8091834]
[2016-01-12 23:43:58,437][DEBUG][action.admin.cluster.node.stats] [dayrhecfdm009_PROD_MASTER] failed to execute on node [a0lfkseaTMCtDxPSUAHiaw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [dayrhecfdm008_PROD_DATA][inet[dayrhecfdm008.enterprisenet.org/10.7.157.12:9260]][cluster:monitor/nodes/stats[n]] request_id [8096901] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Data log details

[2016-01-12 18:21:27,034][WARN ][monitor.jvm ] [dayrhecfdm009_PROD_MASTER] [gc][young][247709][10906] duration [1.5s], collections [1]/[2s], total [1.5s]/[12.3m], memory [26.9gb]->[25.8gb]/[29.7gb], all_pools {[young] [2gb]->[14mb]/[2.1gb]}{[survivor] [17.8mb]->[274.5mb]/[274.5mb]}{[old] [24.9gb]->[25.5gb]/[27.3gb]}
[2016-01-12 23:39:27,536][INFO ][monitor.jvm ] [dayrhecfdm009_PROD_MASTER] [gc][young][266779][12003] duration [806ms], collections [1]/[1.2s], total [806ms]/[13.4m], memory [27.4gb]->[26.2gb]/[29.7gb], all_pools {[young] [2.1gb]->[3mb]/[2.1gb]}{[survivor] [274.5mb]->[274.5mb]/[274.5mb]}{[old] [25.1gb]->[25.9gb]/[27.3gb]}
[2016-01-12 23:40:31,155][WARN ][monitor.jvm ] [dayrhecfdm009_PROD_MASTER] [gc][old][266812][14854] duration [30.9s], collections [1]/[31.5s], total [30.9s]/[20m], memory [28.4gb]->[25.7gb]/[29.7gb], all_pools {[young] [1.6gb]->[44.1mb]/[2.1gb]}{[survivor] [273.5mb]->[0b]/[274.5mb]}{[old] [26.5gb]->[25.6gb]/[27.3gb]}
[2016-01-12 23:43:12,285][WARN ][monitor.jvm ] [dayrhecfdm009_PROD_MASTER] [gc][young][266972][12011] duration [1.2s], collections [1]/[2s], total [1.2s]/[13.4m], memory [27.1gb]->[26.1gb]/[29.7gb], all_pools {[young] [2gb]->[17.4mb]/[2.1gb]}{[survivor] [207mb]->[274.5mb]/[274.5mb]}{[old] [24.8gb]->[25.8gb]/[27.3gb]}

Can someone help me why I am getting this warning?

Please let us know your suggestions.

Thanks,
Ganeshbabu R


(Mark Walkom) #2

Probably that, nodes in ES can timeout if GC is above 30 seconds.


(ganeshbabu) #3

Hi @warkolm

30 seconds is default value in Elasticsearch (or) Can we change it manually to avoid the time out exception?

Regards
Ganeshbabu R


(Mark Walkom) #4

You can change it (check zen discovery docs), but that's not really fixing the problem causing it, so chances are you will run into the same problem again.


(ganeshbabu) #5

Okay..

ES time out took more than 30s to ping response from other nodes when discovering.
@warkolm this warning might be because of network latency?

Regards
Ganeshbabu R


(Mark Walkom) #6

Possibly, yes.


(ganeshbabu) #7

Thanks for your response @warkolm

If possible can you pls check the below request,

Please let us know your suggestions..

Regards,
Ganeshbabu R


(system) #8