Saturating the management thread pool

Charlie_Moad · April 16, 2015, 3:20pm

A few days ago we started to receive a lot of timeouts across our cluster.
This is causing shard allocation to fail and a perpetual red/yellow state.

Examples:
[2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats]
[coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
request_id [3680727] timed out after [15001ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

[2015-04-16 15:03:26,105][WARN ][gateway.local ] [coordinator02]
[global.y2014m01d30.v2][0]: failed to list shard stores on node
[1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.action.FailedNodeException: Failed node
[1rfWT-mXTZmF_NzR_h1IZw]
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)

at

org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
request_id [3677537] timed out after [30001ms]
... 4 more

I believe I have tracked this down to the management thread pool being
saturated on our data nodes and not responding to requests. Our cluster has
3 master nodes,no data and 3 worker nodes,no master. I increased the
maximum pool size from 5 to 20 and the workers immediately jumped to 20.
I'm still seeing the errors.

host management.type management.active
management.size management.queue management.queueSize management.rejected
management.largest management.completed management.min management.max
management.keepAlive
coordinator01 scaling 1
2 0 0
2 37884 1 20 5m
search02 scaling 1
20 0 0
20 1945337 1 20
5m
search01 scaling 1
20 0 0
20 2034838 1 20
5m
search03 scaling 1
20 0 0
20 1862848 1 20
5m
coordinator03 scaling 1
2 0 0
2 37875 1 20 5m
coordinator02 scaling 2
5 0 0
5 44127 1 20 5m

How can I address this problem?

Thanks,
Charlie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/58d67b54-212c-4b72-944b-3ae3f75fe4da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Charlie_Moad · April 17, 2015, 2:37am

This was tracked down to a problem with Ubuntu 14.04 running under Xen (in
AWS). The latest kernel in Ubuntu resolves the problem, so I had to do a
rolling "apt-get update; apt-get dist-upgrade; reboot" on all nodes. This
appears to have resolved the issue.

For reference: Bug #1317811 “Dropped packets on EC2, “xen_netfront: xennet: skb...” : Bugs : linux package : Ubuntu

On Thursday, April 16, 2015 at 11:20:06 AM UTC-4, Charlie Moad wrote:

A few days ago we started to receive a lot of timeouts across our cluster.
This is causing shard allocation to fail and a perpetual red/yellow state.

Examples:
[2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats]
[coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
request_id [3680727] timed out after [15001ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

[2015-04-16 15:03:26,105][WARN ][gateway.local ]
[coordinator02] [global.y2014m01d30.v2][0]: failed to list shard stores on
node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.action.FailedNodeException: Failed node
[1rfWT-mXTZmF_NzR_h1IZw]
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
    at 
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
request_id [3677537] timed out after [30001ms]
... 4 more

I believe I have tracked this down to the management thread pool being
saturated on our data nodes and not responding to requests. Our cluster has
3 master nodes,no data and 3 worker nodes,no master. I increased the
maximum pool size from 5 to 20 and the workers immediately jumped to 20.
I'm still seeing the errors.

host management.type management.active
management.size management.queue management.queueSize management.rejected
management.largest management.completed management.min management.max
management.keepAlive
coordinator01 scaling 1
2 0 0
2 37884 1 20
5m
search02 scaling 1
20 0 0
20 1945337 1 20
5m
search01 scaling 1
20 0 0
20 2034838 1 20
5m
search03 scaling 1
20 0 0
20 1862848 1 20
5m
coordinator03 scaling 1
2 0 0
2 37875 1 20
5m
coordinator02 scaling 2
5 0 0
5 44127 1 20
5m

How can I address this problem?

Thanks,
Charlie

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c49d0468-2d02-49f7-8356-4b9865842eb0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

warkolm · April 17, 2015, 4:23am

Also related Hanging transport connection thread on EC2 · Issue #10447 · elastic/elasticsearch · GitHub

On 17 April 2015 at 12:37, Charlie Moad charlie.moad@geofeedia.com wrote:

This was tracked down to a problem with Ubuntu 14.04 running under Xen (in
AWS). The latest kernel in Ubuntu resolves the problem, so I had to do a
rolling "apt-get update; apt-get dist-upgrade; reboot" on all nodes. This
appears to have resolved the issue.

For reference:
Bug #1317811 “Dropped packets on EC2, “xen_netfront: xennet: skb...” : Bugs : linux package : Ubuntu

On Thursday, April 16, 2015 at 11:20:06 AM UTC-4, Charlie Moad wrote:
A few days ago we started to receive a lot of timeouts across our
cluster. This is causing shard allocation to fail and a perpetual
red/yellow state.

Examples:
[2015-04-16 15:04:50,970][DEBUG][action.admin.cluster.node.stats]
[coordinator02] failed to execute on node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][cluster:monitor/nodes/stats[n]]
request_id [3680727] timed out after [15001ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

[2015-04-16 15:03:26,105][WARN ][gateway.local ]
[coordinator02] [global.y2014m01d30.v2][0]: failed to list shard stores on
node [1rfWT-mXTZmF_NzR_h1IZw]
org.elasticsearch.action.FailedNodeException: Failed node
[1rfWT-mXTZmF_NzR_h1IZw]
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.onFailure(TransportNodesOperationAction.java:206)
at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction.access$1000(TransportNodesOperationAction.java:97)
    at
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$AsyncAction$4.handleException(TransportNodesOperationAction.java:178)
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[search01][inet[ip-172-30-11-161.ec2.internal/172.30.11.161:9300]][internal:cluster/nodes/indices/shard/store[n]]
request_id [3677537] timed out after [30001ms]
... 4 more

I believe I have tracked this down to the management thread pool being
saturated on our data nodes and not responding to requests. Our cluster has
3 master nodes,no data and 3 worker nodes,no master. I increased the
maximum pool size from 5 to 20 and the workers immediately jumped to 20.
I'm still seeing the errors.

host management.type management.active
management.size management.queue management.queueSize management.rejected
management.largest management.completed management.min management.max
management.keepAlive
coordinator01 scaling 1
2 0 0
2 37884 1 20
5m
search02 scaling 1
20 0 0
20 1945337 1 20
5m
search01 scaling 1
20 0 0
20 2034838 1 20
5m
search03 scaling 1
20 0 0
20 1862848 1 20
5m
coordinator03 scaling 1
2 0 0
2 37875 1 20
5m
coordinator02 scaling 2
5 0 0
5 44127 1 20
5m

How can I address this problem?

Thanks,
Charlie
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c49d0468-2d02-49f7-8356-4b9865842eb0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c49d0468-2d02-49f7-8356-4b9865842eb0%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X93cFNkW8MK480%2BfkgNLDZhSdWJ1_--3Ra__ki%3Dh8G0ig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Total threads in use increases without bound until node crashes Elasticsearch	18	1063	July 6, 2017
Are long queues in management threadpool a problem? Elasticsearch	3	423	July 6, 2017
Limit large number of threads Elasticsearch	10	547	July 6, 2017
Certain rest requests time out Elasticsearch	15	481	July 6, 2017
SocketTimeout Exception issue reported for Elasticsearch Node with moderate resource usage Elasticsearch	10	544	July 7, 2024

Saturating the management thread pool

Related topics