Elasticsearch on Kubernetes - cluster red, one node cannot start

dmge · December 21, 2019, 2:03pm

Hello,

Could you please help me to take back my cluster and shards green?
I am using Elastisearch on Kubernetes pods. Have cluster with 3 nodes. After some restarts of the pods/nodes the 3th node does not start, cluster is red, 2 important shards are not assigned. I have the following types of logs in this node (failed to read node state and .index.CorruptIndexException: codec footer mismatch):

[2019-12-21T11:33:23,329][ERROR][o.e.g.GatewayMetaState ] [elasticsearch-2] failed to read local state, exiting...
org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read [id:80, file:/usr/share/elasticsearch/data/nodes/0/_state/global-80.st]

[2019-12-21T11:33:23,329][ERROR][o.e.g.GatewayMetaState ] [elasticsearch-2] failed to read local state, exiting... org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read [id:80, file:/usr/share/elasticsearch/data/nodes/0/_state/global-80.st] at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:165) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:306) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:324) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaStateService.loadGlobalState(MetaStateService.java:112) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:57) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:88) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.node.Node.<init>(Node.java:499) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.node.Node.<init>(Node.java:266) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:212) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:212) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) [elasticsearch-cli-6.8.2.jar:6.8.2] at org.elasticsearch.cli.Command.main(Command.java:90) [elasticsearch-cli-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:116) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:93) [elasticsearch-6.8.2.jar:6.8.2] Caused by: java.io.IOException: failed to read [id:80, file:/usr/share/elasticsearch/data/nodes/0/_state/global-80.st] at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:300) ~[elasticsearch-6.8.2.jar:6.8.2] ... 16 more

Caused by: org.elasticsearch.gateway.CorruptStateException: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-470888448 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/_state/global-80.st"))) at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:202) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:296) ~[elasticsearch-6.8.2.jar:6.8.2] ... 16 more Caused by: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-470888448 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/_state/global-80.st"))) at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:502) ~[lucene-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:28] at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:414) ~[lucene-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:28] at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:526) ~[lucene-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:28] at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:185) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:296) ~[elasticsearch-6.8.2.jar:6.8.2] ... 16 more [2019-12-21T11:33:23,390][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [elasticsearch-2] uncaught exception in thread [main]

[2019-12-21T11:33:23,390][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [elasticsearch-2] uncaught exception in thread [main] org.elasticsearch.bootstrap.StartupException: ElasticsearchException[java.io.IOException: failed to read [id:80, file:/usr/share/elasticsearch/data/nodes/0/_state/global-80.st]]; nested: IOException[failed to read [id:80, file:/usr/share/elasticsearch/data/nodes/0/_state/global-80.st]]; nested: CorruptStateException[org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-470888448 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/_state/global-80.st")))]; nested: CorruptIndexException[codec footer mismatch (file truncated?): actual footer=-470888448 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/_state/global-80.st")))];

DavidTurner · December 22, 2019, 9:55am

The lack of line breaks make this pretty hard to read, but this is the fundamental problem:

org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-470888448 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/_state/global-80.st")))

The metadata on disk is corrupt, or at least it is different from the metadata that Elasticsearch wrote. This indicates something wrong with your storage or filesystem. Elasticsearch presents a workload to the underlying disks that's quite good at exposing bugs (e.g. in the kernel or in the filesystem) as well as faulty hardware.

After addressing that, I think it's best to start this node afresh by wiping the data directory and restore any red indices from a recent snapshot.

dmge · December 22, 2019, 10:29am

Hi David,

Thank you very much for your answer. I highly appreciate it!

Could you please share your opinion and help me clarify a few more things:

Because I do not have snapshot of the data and I do not have understanding of the Elasticsearch logic behind the data management, I am trying to figure out our best options:

If we try to resolve the issues of the underlying storage volume (yes, it is GlusterFS), is there a chance the data from these 2 shards to be joined back to the Elasticsearch data? What approach we can use to join back the data?
If we move the data directory on other location, after the cluster is up again, is there a option to to restore the 2 shards from the moved old data?

DavidTurner · December 22, 2019, 12:02pm

Unfortunately the only other way forward is to create the index from the source data again. If you no longer have the source data then I don't have any recovery suggestions.

Elasticsearch works pretty hard to protect the integrity of your data, but of course there's a limit to how resilient it is reasonable to be. Your daring choice of filesystem plus the lack of replicas or snapshots is just too much to handle.

dmge · December 22, 2019, 7:42pm

Hi David,

Do you mean the whole index? Could I use the other 3 healthy shards of this index?

DavidTurner · December 22, 2019, 10:55pm

You can, but I don't see how it helps since each shard contains some random fraction of the documents in an index. The reference manual contains instructions for accepting this kind of data loss by allocating a stale or empty primary.

system · January 19, 2020, 10:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to recover cluster state red in K8s Elasticsearch	2	275	June 9, 2021
Not able to start elasticsearch-2.0 Elasticsearch	3	473	April 27, 2022
Failed shard on node [bnK31ibrRG6bSpw_pYK2BA]: shard failure, reason [corrupt file (source: [index id[CTR0CY4BdvDW7Z2cSc8C] origin[PRIMARY] seq#[27209007]])], failure org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed Elasticsearch	9	413	April 15, 2024
How to recover my elasticsearch cluster status to `green` Elasticsearch	1	637	April 26, 2018
ES Cluster State Red - cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy Elasticsearch	5	27601	September 21, 2017

Elasticsearch on Kubernetes - cluster red, one node cannot start

Related topics