Hi,
I've been experiencing some weird things from the ES client, when the
server cluster is struggling under heavy load. It looks like it gets stuck
in an infinite retry-loop, spawning new threads until the host runs out for
processes (ulimit) or memory.
I've attached a thread dump (jstack.11.03.txt), and the logs from one of
the cluster nodes (es.log.11.03.txt).
In the thread dump you can see over 2000 elasticsearch-threads.
First of all, how does this happen in the client? It seems like a bug to
me. The conditions are kinda hard to debug, as we do not control how to
initiate the problem, and as soon as the problem arises the client
application forks out so many threads the machine is rendered useless.
Also, with 2 nodes and 3 clients, how can the following stack trace occur?
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:259)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
How does the random index at line 221 of TransportClientNodesService work
in conjunction with the linear probe on line 262? Seems weird.
We have tried running with a very simple setup of:
- 2 cluster nodes (one of which provides the es.log.11.03.txt).
- 1 indexing node (which provides the jstack.11.03.txt).
- 1 search api (didnt fork-bomb in this particular case).
We've experienced similar symptoms in the client for a controlled shutdown
of an elasticsearch node, and OOM in the cluster, and for a queue overrun
as in this example log.
We're in this case running 45 threads in the indexer, that do searching and
indexing. Everything is synchronized by calling .actionGet() on the
Futures, so we never should've reached a queue size of 1000.
The search api only does searches, and it experienced negligible traffic
when this happened.
Any ideas what could cause it?
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d3eeff91-2118-4c5d-8aa2-6dd32fb7c70f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.