Saturating the management thread pool

A few days ago we started to receive a lot of timeouts across our cluster.
This is causing shard allocation to fail and a perpetual red/yellow state.

Examples:
[2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats]
[coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
request_id [3680727] timed out after [15001ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

[2015-04-16 15:03:26,105][WARN ][gateway.local ] [coordinator02]
[global.y2014m01d30.v2][0]: failed to list shard stores on node
[1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.action.FailedNodeException: Failed node
[1rfWT-mXTZmF_NzR_h1IZw]
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)

    at 

org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
request_id [3677537] timed out after [30001ms]
... 4 more

I believe I have tracked this down to the management thread pool being
saturated on our data nodes and not responding to requests. Our cluster has
3 master nodes,no data and 3 worker nodes,no master. I increased the
maximum pool size from 5 to 20 and the workers immediately jumped to 20.
I'm still seeing the errors.

host management.type management.active
management.size management.queue management.queueSize management.rejected
management.largest management.completed management.min management.max
management.keepAlive
coordinator01 scaling 1
2 0 0
2 37884 1 20 5m
search02 scaling 1
20 0 0
20 1945337 1 20
5m
search01 scaling 1
20 0 0
20 2034838 1 20
5m
search03 scaling 1
20 0 0
20 1862848 1 20
5m
coordinator03 scaling 1
2 0 0
2 37875 1 20 5m
coordinator02 scaling 2
5 0 0
5 44127 1 20 5m

How can I address this problem?

Thanks,
Charlie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58d67b54-212c-4b72-944b-3ae3f75fe4da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

This was tracked down to a problem with Ubuntu 14.04 running under Xen (in
AWS). The latest kernel in Ubuntu resolves the problem, so I had to do a
rolling "apt-get update; apt-get dist-upgrade; reboot" on all nodes. This
appears to have resolved the issue.

For reference: Bug #1317811 “Dropped packets on EC2, “xen_netfront: xennet: skb...” : Bugs : linux package : Ubuntu

On Thursday, April 16, 2015 at 11:20:06 AM UTC-4, Charlie Moad wrote:

A few days ago we started to receive a lot of timeouts across our cluster.
This is causing shard allocation to fail and a perpetual red/yellow state.

Examples:
[2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats]
[coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
request_id [3680727] timed out after [15001ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

[2015-04-16 15:03:26,105][WARN ][gateway.local ]
[coordinator02] [global.y2014m01d30.v2][0]: failed to list shard stores on
node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.action.FailedNodeException: Failed node
[1rfWT-mXTZmF_NzR_h1IZw]
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)

    at 

org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
request_id [3677537] timed out after [30001ms]
... 4 more

I believe I have tracked this down to the management thread pool being
saturated on our data nodes and not responding to requests. Our cluster has
3 master nodes,no data and 3 worker nodes,no master. I increased the
maximum pool size from 5 to 20 and the workers immediately jumped to 20.
I'm still seeing the errors.

host management.type management.active
management.size management.queue management.queueSize management.rejected
management.largest management.completed management.min management.max
management.keepAlive
coordinator01 scaling 1
2 0 0
2 37884 1 20
5m
search02 scaling 1
20 0 0
20 1945337 1 20
5m
search01 scaling 1
20 0 0
20 2034838 1 20
5m
search03 scaling 1
20 0 0
20 1862848 1 20
5m
coordinator03 scaling 1
2 0 0
2 37875 1 20
5m
coordinator02 scaling 2
5 0 0
5 44127 1 20
5m

How can I address this problem?

Thanks,
Charlie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c49d0468-2d02-49f7-8356-4b9865842eb0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Also related Hanging transport connection thread on EC2 · Issue #10447 · elastic/elasticsearch · GitHub

On 17 April 2015 at 12:37, Charlie Moad charlie.moad@geofeedia.com wrote:

This was tracked down to a problem with Ubuntu 14.04 running under Xen (in
AWS). The latest kernel in Ubuntu resolves the problem, so I had to do a
rolling "apt-get update; apt-get dist-upgrade; reboot" on all nodes. This
appears to have resolved the issue.

For reference:
Bug #1317811 “Dropped packets on EC2, “xen_netfront: xennet: skb...” : Bugs : linux package : Ubuntu

On Thursday, April 16, 2015 at 11:20:06 AM UTC-4, Charlie Moad wrote:

A few days ago we started to receive a lot of timeouts across our
cluster. This is causing shard allocation to fail and a perpetual
red/yellow state.

Examples:
[2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats]
[coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
request_id [3680727] timed out after [15001ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

[2015-04-16 15:03:26,105][WARN ][gateway.local ]
[coordinator02] [global.y2014m01d30.v2][0]: failed to list shard stores on
node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.action.FailedNodeException: Failed node
[1rfWT-mXTZmF_NzR_h1IZw]
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)

    at

org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
request_id [3677537] timed out after [30001ms]
... 4 more

I believe I have tracked this down to the management thread pool being
saturated on our data nodes and not responding to requests. Our cluster has
3 master nodes,no data and 3 worker nodes,no master. I increased the
maximum pool size from 5 to 20 and the workers immediately jumped to 20.
I'm still seeing the errors.

host management.type management.active
management.size management.queue management.queueSize management.rejected
management.largest management.completed management.min management.max
management.keepAlive
coordinator01 scaling 1
2 0 0
2 37884 1 20
5m
search02 scaling 1
20 0 0
20 1945337 1 20
5m
search01 scaling 1
20 0 0
20 2034838 1 20
5m
search03 scaling 1
20 0 0
20 1862848 1 20
5m
coordinator03 scaling 1
2 0 0
2 37875 1 20
5m
coordinator02 scaling 2
5 0 0
5 44127 1 20
5m

How can I address this problem?

Thanks,
Charlie

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c49d0468-2d02-49f7-8356-4b9865842eb0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c49d0468-2d02-49f7-8356-4b9865842eb0%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X93cFNkW8MK480%2BfkgNLDZhSdWJ1_--3Ra__ki%3Dh8G0ig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.