Client fork-bombs on server error

Hi,

I've been experiencing some weird things from the ES client, when the
server cluster is struggling under heavy load. It looks like it gets stuck
in an infinite retry-loop, spawning new threads until the host runs out for
processes (ulimit) or memory.
I've attached a thread dump (jstack.11.03.txt), and the logs from one of
the cluster nodes (es.log.11.03.txt).
In the thread dump you can see over 2000 elasticsearch-threads.

First of all, how does this happen in the client? It seems like a bug to
me. The conditions are kinda hard to debug, as we do not control how to
initiate the problem, and as soon as the problem arises the client
application forks out so many threads the machine is rendered useless.

Also, with 2 nodes and 3 clients, how can the following stack trace occur?
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:259)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:262)

How does the random index at line 221 of TransportClientNodesService work
in conjunction with the linear probe on line 262? Seems weird.

We have tried running with a very simple setup of:

  • 2 cluster nodes (one of which provides the es.log.11.03.txt).
  • 1 indexing node (which provides the jstack.11.03.txt).
  • 1 search api (didnt fork-bomb in this particular case).

We've experienced similar symptoms in the client for a controlled shutdown
of an elasticsearch node, and OOM in the cluster, and for a queue overrun
as in this example log.
We're in this case running 45 threads in the indexer, that do searching and
indexing. Everything is synchronized by calling .actionGet() on the
Futures, so we never should've reached a queue size of 1000.
The search api only does searches, and it experienced negligible traffic
when this happened.

Any ideas what could cause it?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d3eeff91-2118-4c5d-8aa2-6dd32fb7c70f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I understand Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode - this
is Java 7u51 on OS Mavericks? If so can you downgrade to 7u25?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoHA_yk_rmw9OmzGy8YL%3Dr1YUZFZGyY-r8t1-vd80y3hUA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Yes I can try downgrading as an experiment, getting to it ASAP.

System info:

Elasticsearch 0.90.7.

$ cat /etc/issue
Red Hat Enterprise Linux Server release 6.4 (Santiago)
Kernel \r on an \m

$ cat /etc/lsb-release
LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch

$ java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

--
Magnus Haug

On Saturday, March 15, 2014 11:49:51 AM UTC+1, Jörg Prante wrote:

I understand Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode -
this is Java 7u51 on OS Mavericks? If so can you downgrade to 7u25?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e4e6e0a6-4d0a-4942-bfb0-806622e6348f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Java 1.7u25 did not improve the situation. The java-client still
fork-bombed.

Any other ideas for what we can try? Any ideas why it might happen?

--
Magnus Haug

On Monday, March 17, 2014 10:52:37 AM UTC+1, mag...@stack.no wrote:

Yes I can try downgrading as an experiment, getting to it ASAP.

System info:

Elasticsearch 0.90.7.

$ cat /etc/issue
Red Hat Enterprise Linux Server release 6.4 (Santiago)
Kernel \r on an \m

$ cat /etc/lsb-release

LSB_VERSION=base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch

$ java -version
java version "1.7.0_51"
Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)

--
Magnus Haug

On Saturday, March 15, 2014 11:49:51 AM UTC+1, Jörg Prante wrote:

I understand Java HotSpot(TM) 64-Bit Server VM (24.51-b03 mixed mode -
this is Java 7u51 on OS Mavericks? If so can you downgrade to 7u25?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9b10f75c-97a8-47ce-8879-1ba6d3cb721e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.