One of my shards stuck in INITIALIZING

Hi All,

I have a cluster with 2 nodes(elasticsearch-2.3.4, java version "1.8.0_73") and my index have 8 shards.

The first error is when the the jvm mem hit 99% while i am indexing around 1000 messages per 10 to 15 seconds.

I try to recover by restarting but noticed that the shard[7] is always stuck in INITIALIZATION. I try to recover by deleting the shard[7] or the whole index of the affected nodes and restarting the elasticsearch but it keeps being stuck in shard[7] the other shards seems not encountering the problem.

Here is the current status
chatlogs_prd1 1 r STARTED 1265795 10.2gb 10.0.3.169 sgrlelstica01
chatlogs_prd1 1 p STARTED 1265795 10.2gb 10.0.4.232 sgrlelsticc01
chatlogs_prd1 3 r STARTED 1358479 11.2gb 10.0.3.169 sgrlelstica01
chatlogs_prd1 3 p STARTED 1358479 11.2gb 10.0.4.232 sgrlelsticc01
chatlogs_prd1 5 r STARTED 1342258 11gb 10.0.3.169 sgrlelstica01
chatlogs_prd1 5 p STARTED 1342258 11gb 10.0.4.232 sgrlelsticc01
chatlogs_prd1 6 r STARTED 1360696 11.2gb 10.0.3.169 sgrlelstica01
chatlogs_prd1 6 p STARTED 1360696 11.2gb 10.0.4.232 sgrlelsticc01
chatlogs_prd1 2 r STARTED 1334285 10.9gb 10.0.3.169 sgrlelstica01
chatlogs_prd1 2 p STARTED 1334285 10.9gb 10.0.4.232 sgrlelsticc01
chatlogs_prd1 7 r INITIALIZING 10.0.3.169 sgrlelstica01
chatlogs_prd1 7 p STARTED 1377614 11.3gb 10.0.4.232 sgrlelsticc01
chatlogs_prd1 4 r STARTED 1238838 10gb 10.0.3.169 sgrlelstica01
chatlogs_prd1 4 p STARTED 1238838 10gb 10.0.4.232 sgrlelsticc01
chatlogs_prd1 0 r STARTED 1252498 10.1gb 10.0.3.169 sgrlelstica01
chatlogs_prd1 0 p STARTED 1252498 10.1gb 10.0.4.232 sgrlelsticc01

It mentioned OOM but when i checked the jvm.mem.heap_used_percent it is around 95% and the GC is running every seconds for the old and the young. But the number of segments is not changing, always at 453 segments.

What could be the caused of the OOM?

Thanks!

Here is the error in the logs
[2016-08-15 07:47:52,603][WARN ][cluster.action.shard ] [sgrlelstica01] [chatlogs_prd1][7] received shard failed for target shard [[chatlogs_prd1][7], node[j4J7Q7a9T1SQoP683EC9Bw], [R], v[40230], s[INITIALIZING], a[id=6-MSlILeTgytvhftl_CNeQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2016-08-15T07:47:51.728Z], details[failed recovery, failure RecoveryFailedException[[chatlogs_prd1][7]: Recovery failed from {sgrlelsticc01}{iZwCFE9IQbqmIetGNKAqQw}{10.0.4.232}{10.0.4.232:9300}{master=false} into {sgrlelstica01}{j4J7Q7a9T1SQoP683EC9Bw}{10.0.3.169}{10.0.3.169:9300}{master=true}]; nested: RemoteTransportException[[sgrlelsticc01][10.0.4.232:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: RemoteTransportException[[sgrlelstica01][10.0.3.169:9300][internal:index/shard/recovery/prepare_translog]]; nested: OutOfMemoryError[Java heap space]; ]]], indexUUID [7hxG06lwS6iUiC8PJOfirw], message [failed recovery], failure [RecoveryFailedException[[chatlogs_prd1][7]: Recovery failed from {sgrlelsticc01}{iZwCFE9IQbqmIetGNKAqQw}{10.0.4.232}{10.0.4.232:9300}{master=false} into {sgrlelstica01}{j4J7Q7a9T1SQoP683EC9Bw}{10.0.3.169}{10.0.3.169:9300}{master=true}]; nested: RemoteTransportException[[sgrlelsticc01][10.0.4.232:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: RemoteTransportException[[sgrlelstica01][10.0.3.169:9300][internal:index/shard/recovery/prepare_translog]]; nested: OutOfMemoryError[Java heap space]; ]
RecoveryFailedException[[chatlogs_prd1][7]: Recovery failed from {sgrlelsticc01}{iZwCFE9IQbqmIetGNKAqQw}{10.0.4.232}{10.0.4.232:9300}{master=false} into {sgrlelstica01}{j4J7Q7a9T1SQoP683EC9Bw}{10.0.3.169}{10.0.3.169:9300}{master=true}]; nested: RemoteTransportException[[sgrlelsticc01][10.0.4.232:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: RemoteTransportException[[sgrlelstica01][10.0.3.169:9300][internal:index/shard/recovery/prepare_translog]]; nested: OutOfMemoryError[Java heap space];
at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:258)
at org.elasticsearch.indices.recovery.RecoveryTarget.access$1100(RecoveryTarget.java:69)
at org.elasticsearch.indices.recovery.RecoveryTarget$RecoveryRunner.doRun(RecoveryTarget.java:508)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Here is the continuation

Caused by: RemoteTransportException[[sgrlelsticc01][10.0.4.232:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: RemoteTransportException[[sgrlelstica01][10.0.3.169:9300][internal:index/shard/recovery/prepare_translog]]; nested: OutOfMemoryError[Java heap space];
Caused by: [chatlogs_prd1][[chatlogs_prd1][7]] RecoveryEngineException[Phase[1] phase1 failed]; nested: RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: RemoteTransportException[[sgrlelstica01][10.0.3.169:9300][internal:index/shard/recovery/prepare_translog]]; nested: OutOfMemoryError[Java heap space];
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:135)
at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:126)
at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:52)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:135)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:132)
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [chatlogs_prd1][[chatlogs_prd1][7]] RecoverFilesRecoveryException[Failed to transfer [0] files with total size of [0b]]; nested: RemoteTransportException[[sgrlelstica01][10.0.3.169:9300][internal:index/shard/recovery/prepare_translog]]; nested: OutOfMemoryError[Java heap space];
at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:453)
at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:133)
... 11 more
Caused by: RemoteTransportException[[sgrlelstica01][10.0.3.169:9300][internal:index/shard/recovery/prepare_translog]]; nested: OutOfMemoryError[Java heap space];
Caused by: java.lang.OutOfMemoryError: Java heap space

It looks like you are running out of heap space. How much heap do you have configured?

I am using a small AWS instance only with 4GB. So i configured it to be 2GB for elastic.

I have another setup with more bigger data but i don't get this problem. and also the jvm.mem shows i am using 95%. I know that i need to be worried when it reached 85% but i am just in recovering mode and no new data is being inserted and no one is searching.