We have single server node and single client node cluster running
1.0.0.Beta1. We recently noticed that a couple of times when the server
node is not available (shutdown for upgrade, maintenance, etc), the
elasticsearch client on the client node would continue to create client
threads until the system runs out of the memory. The thread dump shows
10K+ elasticsearch client threads in JVM. At this point even when the
server node is back up the client doesn't recover gracefully. We need to
manually kill the JVM and restart.
When I looked at the thread dump I see a lot of threads on this:
"elasticsearch[Anole][generic][T#2]" daemon prio=5 tid=0x00007fbebd903800
nid=0x10403 waiting on condition [0x000000011b696000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
parking to wait for <0x00000007006163d0> (a
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:927)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:203)
at
org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:68)
at
org.elasticsearch.client.transport.support.InternalTransportClient$2.doWithNode(InternalTransportClient.java:109)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:252)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:255)
..... hundreds of frames on RetryListener.onFailure ......
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:255)
at
org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:89)
at
org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
I see a couple of problems here. It appears the Executors are configured
to create unbounded number of threads, thus the risk of running out of
memory. Also each thread is blowing up the stack space trying to retry the
request.
Is there any way to avoid running into this problem? Any help would be
appreciated.
I can confirm this, we are running 0.90.7 and having problems with an OOM,
failed too create native thread.
In essence ES Client starts to create 1000's of threads and we end up
hitting the limit defined in the OS, in our case 16384. This happens in a
matter of minuets!
Regards,
Serge
On Wednesday, December 4, 2013 7:27:54 PM UTC+1, Jongyoon Lee wrote:
We have single server node and single client node cluster running
1.0.0.Beta1. We recently noticed that a couple of times when the server
node is not available (shutdown for upgrade, maintenance, etc), the
elasticsearch client on the client node would continue to create client
threads until the system runs out of the memory. The thread dump shows
10K+ elasticsearch client threads in JVM. At this point even when the
server node is back up the client doesn't recover gracefully. We need to
manually kill the JVM and restart.
When I looked at the thread dump I see a lot of threads on this:
"elasticsearch[Anole][generic][T#2]" daemon prio=5 tid=0x00007fbebd903800
nid=0x10403 waiting on condition [0x000000011b696000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
parking to wait for <0x00000007006163d0> (a
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:927)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:203)
at
org.elasticsearch.action.TransportActionNodeProxy.execute(TransportActionNodeProxy.java:68)
at
org.elasticsearch.client.transport.support.InternalTransportClient$2.doWithNode(InternalTransportClient.java:109)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:252)
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:255)
..... hundreds of frames on RetryListener.onFailure ......
at
org.elasticsearch.client.transport.TransportClientNodesService$RetryListener.onFailure(TransportClientNodesService.java:255)
at
org.elasticsearch.action.TransportActionNodeProxy$1.handleException(TransportActionNodeProxy.java:89)
at
org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
I see a couple of problems here. It appears the Executors are configured
to create unbounded number of threads, thus the risk of running out of
memory. Also each thread is blowing up the stack space trying to retry the
request.
Is there any way to avoid running into this problem? Any help would be
appreciated.
@Jong - when upgrading from the 1.0.0beta1 to 1.0.0GA you'd have to
reindex. We had to change some aspects of the data during the beta period.
There is no problem with upgrading from 0.90.x to 1.0.0
Cheers,
Boaz
On Tuesday, February 18, 2014 8:56:04 AM UTC+1, Teppo Kurki wrote:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.