ES gone into a hung state, our production down. please help!

Hi,
Our website kept going down within 5min of apache restart and after hours
of debugging, it turned out that the calls to ES were hanging forever. and
that would prevent apache from serving any more requests.
I noticed that even my indexer was hung on call to ES (actionGet). I did a
jstack of ES and almost all threads look like below. Seems like it went
into a deadlocked state.

Has anyone seen this before? Can you please help? What needs to be done
(other than restart ES, which i dont want to do till i extract all
debugging info required)? How can this be prevented in future?

ES process stack -
"elasticsearch[search]-pool-3-thread-33287" daemon prio=10
tid=0x00007f4efba4d00
0 nid=0x1593 waiting on condition [0x00007f4f45160000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000778483358> (a
java.util.concurrent.lock
s.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInt
errupt(AbstractQueuedSynchronizer.java:838)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared
(AbstractQueuedSynchronizer.java:968)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(A
bstractQueuedSynchronizer.java:1284)
at
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(Reent
rantReadWriteLock.java:627)
at
org.elasticsearch.index.engine.robin.RobinEngine.get(RobinEngine.java
:296)
at
org.elasticsearch.index.shard.service.InternalIndexShard.get(Internal
IndexShard.java:378)
at
org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.
java:133)
at
org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:
100)
at
org.elasticsearch.action.get.TransportGetAction.shardOperation(Transp
ortGetAction.java:104)
at
org.elasticsearch.action.get.TransportGetAction.shardOperation(Transp
ortGetAction.java:47)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOpe
rationAction$AsyncSingleAction$1.run(TransportShardSingleOperationAction.java:15
1)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:603)
at java.lang.Thread.run(Thread.java:636)

My indexer stack which is hung for many hours now -
"ESThread-1" prio=10 tid=0x00007f75742b2000 nid=0x67ae waiting on condition
[0x00007f756860e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000008b5b1c88> (a
org.elasticsearch.common.util.concurrent.AbstractFuture$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at
org.elasticsearch.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:249)
at
org.elasticsearch.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:78)
at
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:44)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.createUpdateDocument(ESMessageProcessor.java:128)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.processMessage(ESMessageProcessor.java:110)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.run(ESMessageProcessor.java:95)
at java.lang.Thread.run(Thread.java:636)

thanks

Also, I am using ES 0.18.7.

thanks

On Sat, May 19, 2012 at 8:48 AM, T Vinod Gupta tvinod@readypulse.comwrote:

Hi,
Our website kept going down within 5min of apache restart and after hours
of debugging, it turned out that the calls to ES were hanging forever. and
that would prevent apache from serving any more requests.
I noticed that even my indexer was hung on call to ES (actionGet). I did a
jstack of ES and almost all threads look like below. Seems like it went
into a deadlocked state.

Has anyone seen this before? Can you please help? What needs to be done
(other than restart ES, which i dont want to do till i extract all
debugging info required)? How can this be prevented in future?

ES process stack -
"elasticsearch[search]-pool-3-thread-33287" daemon prio=10
tid=0x00007f4efba4d00
0 nid=0x1593 waiting on condition [0x00007f4f45160000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000778483358> (a
java.util.concurrent.lock
s.ReentrantReadWriteLock$NonfairSync)
at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInt
errupt(AbstractQueuedSynchronizer.java:838)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared
(AbstractQueuedSynchronizer.java:968)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(A
bstractQueuedSynchronizer.java:1284)
at
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(Reent
rantReadWriteLock.java:627)
at
org.elasticsearch.index.engine.robin.RobinEngine.get(RobinEngine.java
:296)
at
org.elasticsearch.index.shard.service.InternalIndexShard.get(Internal
IndexShard.java:378)
at
org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.
java:133)
at
org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:
100)
at
org.elasticsearch.action.get.TransportGetAction.shardOperation(Transp
ortGetAction.java:104)
at
org.elasticsearch.action.get.TransportGetAction.shardOperation(Transp
ortGetAction.java:47)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOpe

rationAction$AsyncSingleAction$1.run(TransportShardSingleOperationAction.java:15
1)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:603)
at java.lang.Thread.run(Thread.java:636)

My indexer stack which is hung for many hours now -
"ESThread-1" prio=10 tid=0x00007f75742b2000 nid=0x67ae waiting on
condition [0x00007f756860e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000008b5b1c88> (a
org.elasticsearch.common.util.concurrent.AbstractFuture$Sync)
at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at
org.elasticsearch.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:249)
at
org.elasticsearch.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:78)
at
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:44)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.createUpdateDocument(ESMessageProcessor.java:128)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.processMessage(ESMessageProcessor.java:110)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.run(ESMessageProcessor.java:95)
at java.lang.Thread.run(Thread.java:636)

thanks

Hi,

That indeed looks like some ugly lock. I would:

  1. grab the latest version of ES - the one you have is N months old and
    most people don't like debugging bugs that have already possibly been fixed
  2. grab SPM for ES or Bigdesk to monitor your ES cluster - see
    Sematext Monitoring | Infrastructure Monitoring Service . Maybe this lock is a manifestation of
    issues that are visible (earlier) in a monitoring tool.

Otis

On Saturday, May 19, 2012 11:48:18 AM UTC-4, T Vinod Gupta wrote:

Hi,
Our website kept going down within 5min of apache restart and after hours
of debugging, it turned out that the calls to ES were hanging forever. and
that would prevent apache from serving any more requests.
I noticed that even my indexer was hung on call to ES (actionGet). I did a
jstack of ES and almost all threads look like below. Seems like it went
into a deadlocked state.

Has anyone seen this before? Can you please help? What needs to be done
(other than restart ES, which i dont want to do till i extract all
debugging info required)? How can this be prevented in future?

ES process stack -
"elasticsearch[search]-pool-3-thread-33287" daemon prio=10
tid=0x00007f4efba4d00
0 nid=0x1593 waiting on condition [0x00007f4f45160000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000778483358> (a
java.util.concurrent.lock
s.ReentrantReadWriteLock$NonfairSync)
at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInt
errupt(AbstractQueuedSynchronizer.java:838)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared
(AbstractQueuedSynchronizer.java:968)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(A
bstractQueuedSynchronizer.java:1284)
at
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(Reent
rantReadWriteLock.java:627)
at
org.elasticsearch.index.engine.robin.RobinEngine.get(RobinEngine.java
:296)
at
org.elasticsearch.index.shard.service.InternalIndexShard.get(Internal
IndexShard.java:378)
at
org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.
java:133)
at
org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:
100)
at
org.elasticsearch.action.get.TransportGetAction.shardOperation(Transp
ortGetAction.java:104)
at
org.elasticsearch.action.get.TransportGetAction.shardOperation(Transp
ortGetAction.java:47)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOpe

rationAction$AsyncSingleAction$1.run(TransportShardSingleOperationAction.java:15
1)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:603)
at java.lang.Thread.run(Thread.java:636)

My indexer stack which is hung for many hours now -
"ESThread-1" prio=10 tid=0x00007f75742b2000 nid=0x67ae waiting on
condition [0x00007f756860e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000008b5b1c88> (a
org.elasticsearch.common.util.concurrent.AbstractFuture$Sync)
at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at
org.elasticsearch.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:249)
at
org.elasticsearch.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:78)
at
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:44)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.createUpdateDocument(ESMessageProcessor.java:128)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.processMessage(ESMessageProcessor.java:110)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.run(ESMessageProcessor.java:95)
at java.lang.Thread.run(Thread.java:636)

thanks

Can you gist the full jstack? Once you do that, the quickest way to recover
is to do a restart.

On Sat, May 19, 2012 at 5:48 PM, T Vinod Gupta tvinod@readypulse.comwrote:

Hi,
Our website kept going down within 5min of apache restart and after hours
of debugging, it turned out that the calls to ES were hanging forever. and
that would prevent apache from serving any more requests.
I noticed that even my indexer was hung on call to ES (actionGet). I did a
jstack of ES and almost all threads look like below. Seems like it went
into a deadlocked state.

Has anyone seen this before? Can you please help? What needs to be done
(other than restart ES, which i dont want to do till i extract all
debugging info required)? How can this be prevented in future?

ES process stack -
"elasticsearch[search]-pool-3-thread-33287" daemon prio=10
tid=0x00007f4efba4d00
0 nid=0x1593 waiting on condition [0x00007f4f45160000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000778483358> (a
java.util.concurrent.lock
s.ReentrantReadWriteLock$NonfairSync)
at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInt
errupt(AbstractQueuedSynchronizer.java:838)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared
(AbstractQueuedSynchronizer.java:968)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(A
bstractQueuedSynchronizer.java:1284)
at
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(Reent
rantReadWriteLock.java:627)
at
org.elasticsearch.index.engine.robin.RobinEngine.get(RobinEngine.java
:296)
at
org.elasticsearch.index.shard.service.InternalIndexShard.get(Internal
IndexShard.java:378)
at
org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.
java:133)
at
org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:
100)
at
org.elasticsearch.action.get.TransportGetAction.shardOperation(Transp
ortGetAction.java:104)
at
org.elasticsearch.action.get.TransportGetAction.shardOperation(Transp
ortGetAction.java:47)
at
org.elasticsearch.action.support.single.shard.TransportShardSingleOpe

rationAction$AsyncSingleAction$1.run(TransportShardSingleOperationAction.java:15
1)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:603)
at java.lang.Thread.run(Thread.java:636)

My indexer stack which is hung for many hours now -
"ESThread-1" prio=10 tid=0x00007f75742b2000 nid=0x67ae waiting on
condition [0x00007f756860e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000008b5b1c88> (a
org.elasticsearch.common.util.concurrent.AbstractFuture$Sync)
at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at
org.elasticsearch.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:249)
at
org.elasticsearch.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:78)
at
org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:44)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.createUpdateDocument(ESMessageProcessor.java:128)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.processMessage(ESMessageProcessor.java:110)
at
com.akanksh.information.searchindexer.messageconsumer.ESMessageProcessor.run(ESMessageProcessor.java:95)
at java.lang.Thread.run(Thread.java:636)

thanks