TransportClient behavior when the server node is not availble

We have single server node and single client node cluster running
1.0.0.Beta1. We recently noticed that a couple of times when the server
node is not available (shutdown for upgrade, maintenance, etc), the
elasticsearch client on the client node would continue to create client
threads until the system runs out of the memory. The thread dump shows
10K+ elasticsearch client threads in JVM. At this point even when the
server node is back up the client doesn't recover gracefully. We need to
manually kill the JVM and restart.

When I looked at the thread dump I see a lot of threads on this:

"elasticsearch[Anole][generic][T#2]" daemon prio=5 tid=0x00007fbebd903800
nid=0x10403 waiting on condition [0x000000011b696000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)

  • parking to wait for <0x00000007006163d0> (a
    java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
    at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
    at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
    at
    java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
    at
    java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:927)
    at
    java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
    at
    org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:203)
    at
    org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:68)
    at
    org.elasticsearch.client.transport.support.InternalTransportClient$2.doWithNode(InternalTransportClient.java:109)
    at
    org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:252)
    at
    org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:255)

..... hundreds of frames on RetryListener.onFailure ......

at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:255)
at
org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:89)
at
org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

I see a couple of problems here. It appears the Executors are configured
to create unbounded number of threads, thus the risk of running out of
memory. Also each thread is blowing up the stack space trying to retry the
request.

Is there any way to avoid running into this problem? Any help would be
appreciated.

Jong

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/99a61d0e-cb5b-4d5a-9b8f-c1602711c161%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jong,

I can confirm this, we are running 0.90.7 and having problems with an OOM,
failed too create native thread.
In essence ES Client starts to create 1000's of threads and we end up
hitting the limit defined in the OS, in our case 16384. This happens in a
matter of minuets!

Regards,
Serge

On Wednesday, December 4, 2013 7:27:54 PM UTC+1, Jongyoon Lee wrote:

We have single server node and single client node cluster running
1.0.0.Beta1. We recently noticed that a couple of times when the server
node is not available (shutdown for upgrade, maintenance, etc), the
elasticsearch client on the client node would continue to create client
threads until the system runs out of the memory. The thread dump shows
10K+ elasticsearch client threads in JVM. At this point even when the
server node is back up the client doesn't recover gracefully. We need to
manually kill the JVM and restart.

When I looked at the thread dump I see a lot of threads on this:

"elasticsearch[Anole][generic][T#2]" daemon prio=5 tid=0x00007fbebd903800
nid=0x10403 waiting on condition [0x000000011b696000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)

  • parking to wait for <0x00000007006163d0> (a
    java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
    at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
    at
    java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
    at
    java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
    at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
    at
    java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:927)
    at
    java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
    at
    org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:203)
    at
    org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:68)
    at
    org.elasticsearch.client.transport.support.InternalTransportClient$2.doWithNode(InternalTransportClient.java:109)
    at
    org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:252)
    at
    org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:255)

..... hundreds of frames on RetryListener.onFailure ......

at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:255)
at
org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:89)
at
org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

I see a couple of problems here. It appears the Executors are configured
to create unbounded number of threads, thus the risk of running out of
memory. Also each thread is blowing up the stack space trying to retry the
request.

Is there any way to avoid running into this problem? Any help would be
appreciated.

Jong

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/230b9866-9584-4e1a-af75-174268267616%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/63849ef2-5248-45de-bf75-dbd5acaafbcb%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jong, Serge,

I suspect this was fixed in a later version of ES (see discussion
TransportClient behavior when the server node is not available · Issue #5151 · elastic/elasticsearch · GitHub ). Can you
upgrade to 1.0.0GA or 0.90.11 and check?

@Jong - when upgrading from the 1.0.0beta1 to 1.0.0GA you'd have to
reindex. We had to change some aspects of the data during the beta period.
There is no problem with upgrading from 0.90.x to 1.0.0

Cheers,
Boaz

On Tuesday, February 18, 2014 8:56:04 AM UTC+1, Teppo Kurki wrote:

TransportClient behavior when the server node is not available · Issue #5151 · elastic/elasticsearch · GitHub
Unbound threadpools considered harmful · Issue #5152 · elastic/elasticsearch · GitHub

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e4fc23c5-6ff9-4e79-b217-282d51598bcf%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.