Elasticsearch on Kubernetes - cluster red, one node cannot start

Hello,

Could you please help me to take back my cluster and shards green?
I am using Elastisearch on Kubernetes pods. Have cluster with 3 nodes. After some restarts of the pods/nodes the 3th node does not start, cluster is red, 2 important shards are not assigned. I have the following types of logs in this node (failed to read node state and .index.CorruptIndexException: codec footer mismatch):

[2019-12-21T11:33:23,329][ERROR][o.e.g.GatewayMetaState ] [elasticsearch-2] failed to read local state, exiting...
org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read [id:80, file:/usr/share/elasticsearch/data/nodes/0/_state/global-80.st]

[2019-12-21T11:33:23,329][ERROR][o.e.g.GatewayMetaState ] [elasticsearch-2] failed to read local state, exiting... org.elasticsearch.ElasticsearchException: java.io.IOException: failed to read [id:80, file:/usr/share/elasticsearch/data/nodes/0/_state/global-80.st] at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:165) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:306) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:324) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaStateService.loadGlobalState(MetaStateService.java:112) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:57) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:88) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.node.Node.<init>(Node.java:499) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.node.Node.<init>(Node.java:266) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:212) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:212) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:333) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) [elasticsearch-cli-6.8.2.jar:6.8.2] at org.elasticsearch.cli.Command.main(Command.java:90) [elasticsearch-cli-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:116) [elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:93) [elasticsearch-6.8.2.jar:6.8.2] Caused by: java.io.IOException: failed to read [id:80, file:/usr/share/elasticsearch/data/nodes/0/_state/global-80.st] at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:300) ~[elasticsearch-6.8.2.jar:6.8.2] ... 16 more

Caused by: org.elasticsearch.gateway.CorruptStateException: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-470888448 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/_state/global-80.st"))) at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:202) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:296) ~[elasticsearch-6.8.2.jar:6.8.2] ... 16 more Caused by: org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-470888448 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/_state/global-80.st"))) at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:502) ~[lucene-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:28] at org.apache.lucene.codecs.CodecUtil.checkFooter(CodecUtil.java:414) ~[lucene-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:28] at org.apache.lucene.codecs.CodecUtil.checksumEntireFile(CodecUtil.java:526) ~[lucene-core-7.7.0.jar:7.7.0 8c831daf4eb41153c25ddb152501ab5bae3ea3d5 - jimczi - 2019-02-04 23:16:28] at org.elasticsearch.gateway.MetaDataStateFormat.read(MetaDataStateFormat.java:185) ~[elasticsearch-6.8.2.jar:6.8.2] at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestStateWithGeneration(MetaDataStateFormat.java:296) ~[elasticsearch-6.8.2.jar:6.8.2] ... 16 more [2019-12-21T11:33:23,390][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [elasticsearch-2] uncaught exception in thread [main]

[2019-12-21T11:33:23,390][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [elasticsearch-2] uncaught exception in thread [main] org.elasticsearch.bootstrap.StartupException: ElasticsearchException[java.io.IOException: failed to read [id:80, file:/usr/share/elasticsearch/data/nodes/0/_state/global-80.st]]; nested: IOException[failed to read [id:80, file:/usr/share/elasticsearch/data/nodes/0/_state/global-80.st]]; nested: CorruptStateException[org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-470888448 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/_state/global-80.st")))]; nested: CorruptIndexException[codec footer mismatch (file truncated?): actual footer=-470888448 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/_state/global-80.st")))];

The lack of line breaks make this pretty hard to read, but this is the fundamental problem:

org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=-470888448 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/usr/share/elasticsearch/data/nodes/0/_state/global-80.st")))

The metadata on disk is corrupt, or at least it is different from the metadata that Elasticsearch wrote. This indicates something wrong with your storage or filesystem. Elasticsearch presents a workload to the underlying disks that's quite good at exposing bugs (e.g. in the kernel or in the filesystem) as well as faulty hardware.

After addressing that, I think it's best to start this node afresh by wiping the data directory and restore any red indices from a recent snapshot.

Hi David,

Thank you very much for your answer. I highly appreciate it!

Could you please share your opinion and help me clarify a few more things:

Because I do not have snapshot of the data and I do not have understanding of the Elasticsearch logic behind the data management, I am trying to figure out our best options:

  1. If we try to resolve the issues of the underlying storage volume (yes, it is GlusterFS), is there a chance the data from these 2 shards to be joined back to the Elasticsearch data? What approach we can use to join back the data?

  2. If we move the data directory on other location, after the cluster is up again, is there a option to to restore the 2 shards from the moved old data?

Unfortunately the only other way forward is to create the index from the source data again. If you no longer have the source data then I don't have any recovery suggestions.

Elasticsearch works pretty hard to protect the integrity of your data, but of course there's a limit to how resilient it is reasonable to be. Your daring choice of filesystem plus the lack of replicas or snapshots is just too much to handle.

Hi David,

Do you mean the whole index? Could I use the other 3 healthy shards of this index?

You can, but I don't see how it helps since each shard contains some random fraction of the documents in an index. The reference manual contains instructions for accepting this kind of data loss by allocating a stale or empty primary.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.