Data corruption in a few elasticsearch indices

Hi all,

We've encountered a data corruption in our main production elasticsearch instance. We started getting the following response when querying the corrupted index:
{"error":{"root_cause":[{"type":"no_shard_available_action_exception","reason":"No shard available for [get [hotels3][points][4185]: routing [null]]"}],"type":"no_shard_available_action_exception","reason":"No shard available for [get [hotels3][points][4185]: routing [null]]"},"status":503}

_cat returns the following state (we didn't change any settings):
hotels3 0 p UNASSIGNED
hotels3 0 r UNASSIGNED

replicated cluster instances got populated with the same error and lost all data which was also a big hit.
restarting elasticsearch did not recover the indices.

I saw many topics with the same error however none had a reason that applied to us or a solution that worked for us.

  1. what could cause this issue and how to prevent it
  2. how to fix the indices
  3. how to prevent the replicated cluster nodes from getting corrupted as well?

Any help will be greatly appreciated.

Thank you,
Ami

cluster logs:
[2017-10-23T09:57:22,773][DEBUG][o.e.a.s.TransportSearchAction] [escl01] All shards failed for phase: [query]
org.elasticsearch.action.NoShardAvailableActionException: null
at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:122) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.executeSearch(TransportSearchAction.java:240) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:146) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:67) ~
...

...
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.7.Final.jar:4.1.7.Final]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-10-23T09:57:22,777][WARN ][r.suppressed ] path: /hotels3/_count, params: {index=hotels3}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onInitialPhaseResult(AbstractSearchAsyncAction.java:223) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:122) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.executeSearch(TransportSearchAction.java:240) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:146) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:67) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.3.0.jar:5.3.0]
...

...
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.7.Final.jar:4.1.7.Final]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.action.NoShardAvailableActionException
... 58 more

You need to search the logs of all your nodes for this particular shard, as this shard is not part of your cluster anymore.

Have you maybe lost nodes, that held that shard?

Thanks for your reply. The index is configured with number of shards: 1 and the other nodes are replicas of the main one (main one for us in the sense that it's the only node we're pushing data to, the other nodes are used for read only).

In the 2 other nodes I found this error:
node2:
[internal:index/shard/recovery/start_recovery] sent error response
742720:org.elasticsearch.indices.recovery.DelayRecoveryException: source node does not have the shard listed in its state as allocated on the node

node3:
729465-[2017-10-18T20:10:22,074][TRACE][o.e.t.T.tracer ] [escl03] [14671845][internal:index/shard/recovery/start_recovery] received request
729466-[2017-10-18T20:10:22,075][TRACE][o.e.t.T.tracer ] [escl03] [14671845][internal:index/shard/recovery/start_recovery] sent error response
729467:org.elasticsearch.index.shard.ShardNotFoundException: no such shard
729468- at org.elasticsearch.index.IndexService.getShard(IndexService.java:208) ~[elasticsearch-5.3.0.jar:5.3.0]

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.