A few days ago we started to receive a lot of timeouts across our cluster.
This is causing shard allocation to fail and a perpetual red/yellow state.
Examples:
[2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats]
[coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
request_id [3680727] timed out after [15001ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2015-04-16 15:03:26,105][WARN ][gateway.local ] [coordinator02]
[global.y2014m01d30.v2][0]: failed to list shard stores on node
[1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.action.FailedNodeException: Failed node
[1rfWT-mXTZmF_NzR_h1IZw]
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
request_id [3677537] timed out after [30001ms]
... 4 more
I believe I have tracked this down to the management thread pool being
saturated on our data nodes and not responding to requests. Our cluster has
3 master nodes,no data and 3 worker nodes,no master. I increased the
maximum pool size from 5 to 20 and the workers immediately jumped to 20.
I'm still seeing the errors.
This was tracked down to a problem with Ubuntu 14.04 running under Xen (in
AWS). The latest kernel in Ubuntu resolves the problem, so I had to do a
rolling "apt-get update; apt-get dist-upgrade; reboot" on all nodes. This
appears to have resolved the issue.
On Thursday, April 16, 2015 at 11:20:06 AM UTC-4, Charlie Moad wrote:
A few days ago we started to receive a lot of timeouts across our cluster.
This is causing shard allocation to fail and a perpetual red/yellow state.
Examples:
[2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats]
[coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
request_id [3680727] timed out after [15001ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2015-04-16 15:03:26,105][WARN ][gateway.local ]
[coordinator02] [global.y2014m01d30.v2][0]: failed to list shard stores on
node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.action.FailedNodeException: Failed node
[1rfWT-mXTZmF_NzR_h1IZw]
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
request_id [3677537] timed out after [30001ms]
... 4 more
I believe I have tracked this down to the management thread pool being
saturated on our data nodes and not responding to requests. Our cluster has
3 master nodes,no data and 3 worker nodes,no master. I increased the
maximum pool size from 5 to 20 and the workers immediately jumped to 20.
I'm still seeing the errors.
This was tracked down to a problem with Ubuntu 14.04 running under Xen (in
AWS). The latest kernel in Ubuntu resolves the problem, so I had to do a
rolling "apt-get update; apt-get dist-upgrade; reboot" on all nodes. This
appears to have resolved the issue.
On Thursday, April 16, 2015 at 11:20:06 AM UTC-4, Charlie Moad wrote:
A few days ago we started to receive a lot of timeouts across our
cluster. This is causing shard allocation to fail and a perpetual
red/yellow state.
Examples:
[2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats]
[coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
request_id [3680727] timed out after [15001ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2015-04-16 15:03:26,105][WARN ][gateway.local ]
[coordinator02] [global.y2014m01d30.v2][0]: failed to list shard stores on
node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.action.FailedNodeException: Failed node
[1rfWT-mXTZmF_NzR_h1IZw]
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
request_id [3677537] timed out after [30001ms]
... 4 more
I believe I have tracked this down to the management thread pool being
saturated on our data nodes and not responding to requests. Our cluster has
3 master nodes,no data and 3 worker nodes,no master. I increased the
maximum pool size from 5 to 20 and the workers immediately jumped to 20.
I'm still seeing the errors.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.