[v1.5.1] Replica shard stuck initializing and can't read stats for primary shard


(Nick Pentreath) #1

Hi,

[Using Elasticsearch 1.5.1]

I currently have an issue where one of 5 primary shards in an index is stuck in INITIALIZING state (for well over 24 hrs now). The primary shard is marked as STARTED but I cannot retrieve stats for that shard.

Output of cat health:

epoch      timestamp cluster       status node.total node.data shards pri relo init unassign pending_tasks
1442473639 07:07:19  BBB           yellow          3         3     63  30    0    2        0             0

Output of cat shards:

index       shard prirep state       docs store ip            node
AAA_1       1     p      STARTED                _____________ es-live-1
AAA_1       1     r      INITIALIZING           _____________ es-live-3
AAA_1       1     r      INITIALIZING           _____________ es-live-2

Output of another index which is fine - where I can see the shard stats

index       shard prirep state       docs store ip            node
graphflow_1 4     p      STARTED  5071499 1.9gb _____________ es-live-1
graphflow_1 4     r      STARTED  5071499 1.9gb _____________ es-live-2
graphflow_1 0     p      STARTED  4620643 1.6gb _____________ es-live-1
graphflow_1 0     r      STARTED  4620643 1.6gb _____________ es-live-2
... 

I also get this:

[2015-09-17 07:17:53,082][DEBUG][action.admin.cluster.stats] [es-live-1] failed to execute on node [-gLPPrH_R4i5RFKYoeXO3w]
org.elasticsearch.index.engine.EngineClosedException: [AAA_1][1] CurrentState[CLOSED]

Originally I was getting a lot of timeouts and some GC errors on the node that held the PRIMARY of the relevant shard. The node was unresponsive and I had to restart it. Since then the cluster has been yellow with this issue.

Search & aggregations seem to be working. But when I try to run a scan-scroll (using elasticsearch-hadoop for bulk analytics jobs), I get

SearchPhaseExecutionException[Failed to execute phase [init_scan], all shards failed]

Any help appreciated.


(Mark Walkom) #2

Try dropping the replica and then adding it back.


(Nick Pentreath) #3

Thanks - that worked. Any idea on the cause for this, and is it something
fixed in later versions?


(Mark Walkom) #4

Could have been a few things, turning up logging will give you a better idea of what the cause is if it happens again.


(system) #5