Data corruption in a few elasticsearch indices

amiexc · October 23, 2017, 12:49am

Hi all,

We've encountered a data corruption in our main production elasticsearch instance. We started getting the following response when querying the corrupted index:
{"error":{"root_cause":[{"type":"no_shard_available_action_exception","reason":"No shard available for [get [hotels3][points][4185]: routing [null]]"}],"type":"no_shard_available_action_exception","reason":"No shard available for [get [hotels3][points][4185]: routing [null]]"},"status":503}

_cat returns the following state (we didn't change any settings):
hotels3 0 p UNASSIGNED
hotels3 0 r UNASSIGNED

replicated cluster instances got populated with the same error and lost all data which was also a big hit.
restarting elasticsearch did not recover the indices.

I saw many topics with the same error however none had a reason that applied to us or a solution that worked for us.

what could cause this issue and how to prevent it
how to fix the indices
how to prevent the replicated cluster nodes from getting corrupted as well?

Any help will be greatly appreciated.

Thank you,
Ami

cluster logs:
[2017-10-23T09:57:22,773][DEBUG][o.e.a.s.TransportSearchAction] [escl01] All shards failed for phase: [query]
org.elasticsearch.action.NoShardAvailableActionException: null
at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:122) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.executeSearch(TransportSearchAction.java:240) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:146) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:67) ~
...

...
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.7.Final.jar:4.1.7.Final]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
[2017-10-23T09:57:22,777][WARN ][r.suppressed ] path: /hotels3/_count, params: {index=hotels3}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onInitialPhaseResult(AbstractSearchAsyncAction.java:223) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:122) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.executeSearch(TransportSearchAction.java:240) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:146) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:67) ~[elasticsearch-5.3.0.jar:5.3.0]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.3.0.jar:5.3.0]
...

...
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.7.Final.jar:4.1.7.Final]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: org.elasticsearch.action.NoShardAvailableActionException
... 58 more

spinscale · October 23, 2017, 8:26am

You need to search the logs of all your nodes for this particular shard, as this shard is not part of your cluster anymore.

Have you maybe lost nodes, that held that shard?

amiexc · October 24, 2017, 1:37am

Thanks for your reply. The index is configured with number of shards: 1 and the other nodes are replicas of the main one (main one for us in the sense that it's the only node we're pushing data to, the other nodes are used for read only).

In the 2 other nodes I found this error:
node2:
[internal:index/shard/recovery/start_recovery] sent error response
742720:org.elasticsearch.indices.recovery.DelayRecoveryException: source node does not have the shard listed in its state as allocated on the node

node3:
729465-[2017-10-18T20:10:22,074][TRACE][o.e.t.T.tracer ] [escl03] [14671845][internal:index/shard/recovery/start_recovery] received request
729466-[2017-10-18T20:10:22,075][TRACE][o.e.t.T.tracer ] [escl03] [14671845][internal:index/shard/recovery/start_recovery] sent error response
729467:org.elasticsearch.index.shard.ShardNotFoundException: no such shard
729468- at org.elasticsearch.index.IndexService.getShard(IndexService.java:208) ~[elasticsearch-5.3.0.jar:5.3.0]

system · November 21, 2017, 1:37am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to find which shard got corrupt Elasticsearch	5	3017	July 6, 2017
Failing Replica Shards Elasticsearch	5	1207	July 6, 2017
Replicate Data Elasticsearch	6	1098	September 28, 2017
Corrupted Index Elasticsearch	1	494	July 6, 2017
My first corrupted index Elasticsearch	1	1334	July 6, 2017

Data corruption in a few elasticsearch indices

Related topics